[
https://issues.apache.org/jira/browse/SQOOP-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258543#comment-14258543
]
Veena Basavaraj commented on SQOOP-1957:
----------------------------------------
This ticket does not make much sense any more, since I understand the problem
better!
In case of CSVIDF, data is the CSV and hence we proactively construct it when
object array is set, for instance setObjectData(..) proactively constructs the
csvText ( represented by data) and everything else in LAZY, i.e in case of
CSVIDF, object array is constructed lazily from the csvText (i.e data). We can
remove the data/ csvText and hold it one place, and I actually now think this
needs to be cleaned up in this RB itself, there is no point in holding the same
thing in 2 variables.
In case of JSONIDF, data is JSONObject ( or any other IDF say AvroIDF), so what
should be the source of truth? Should it be the JSON object alone? i,e we only
hold the JSONObject and lazily construct both csv and Object array? or should
csv always be constructed.
So here are the two options
Option #1 - Do not store anything but data
Store the source of truth in data.
When setData is called, store the JSON in data and nothing else, everything is
lazy
When setObjectData is called, construct the JSON object from it, do not store
any CSVText nor objectArray
When setCsvText is called, also contruct JSON object from it, do not store any
CSVText nor objectArray
when getObjectData is called, convert from JSON to objectArray, so this on
demand
When getCSVText is called, convert from JSON to CSVText, so this on demand
When getData is called, return the JSON
Cons:
of course we wont store anything so not much in memory, but depending on how
JSON IDF will be used, i.e t FROM connector uses JSONIDF and storing data in
"JSON"
then the TO connector might be HDFS and want to use CSV, so in this case we on
demand do the CSV conversion for every single record. But it makes sense though
Option #2 - store the data and the CSV, only so we know HDFS is one of the
major use cases and it might use CSV, but it is no longer true looking at the
latest code in HDFS connector, it does readText in some cases and readArray in
some cases
Source of truth is JSON and CSV
When setData is called, store the JSON in data and also convert JSON to CSVText
and store CSVText
When setObjectData is called, construct CSV and store it the CSVText
When setCSVText is called, construct JSON, and store the csvText
When getObjectData is called use the CSVText and return objectArray, so we can
share the code in CSVIDF, and hence me moving this logic to base class is
justified
When getCSVText is called, just return the stored CSVText, so this no-op
When getData is called, return the JSON
> Does CSVIDF implementation need to store data field and CSV text
> ----------------------------------------------------------------
>
> Key: SQOOP-1957
> URL: https://issues.apache.org/jira/browse/SQOOP-1957
> Project: Sqoop
> Issue Type: Sub-task
> Components: sqoop2-framework
> Reporter: Veena Basavaraj
> Assignee: Veena Basavaraj
> Fix For: no-release
>
>
> can we clean it up? SQOOP-1901 related comment
> https://reviews.apache.org/r/29346/#comment109333
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)