[jira] [Commented] (SQOOP-1957) Does CSVIDF implementation need to store data field and CSV text

Veena Basavaraj (JIRA) Wed, 24 Dec 2014 13:34:23 -0800

    [ 
https://issues.apache.org/jira/browse/SQOOP-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258543#comment-14258543
 ]


Veena Basavaraj commented on SQOOP-1957:
----------------------------------------

This ticket does not make much sense any more, since I understand the problem 
better!

In case of CSVIDF, data is the CSV and hence we proactively construct it when 
object array is set, for instance setObjectData(..) proactively constructs the 
csvText ( represented by data) and everything else in LAZY, i.e in case of 
CSVIDF, object array is constructed lazily from the csvText (i.e data). We can 
remove the data/ csvText and hold it one place, and I actually now think this 
needs to be cleaned up in this RB itself, there is no point in holding the same 
thing in 2 variables.

In case of JSONIDF, data is JSONObject ( or any other IDF say AvroIDF), so what 
should be the source of truth? Should it be the JSON object alone? i,e we only 
hold the JSONObject and lazily construct both csv and Object array? or should 
csv always be constructed.

So here are the two options

Option #1 - Do not store anything but data

Store the source of truth in data.
When setData is called, store the JSON in data and nothing else, everything is 
lazy
When setObjectData is called, construct the JSON object from it, do not store 
any CSVText nor objectArray
When setCsvText is called, also contruct JSON object from it, do not store any 
CSVText nor objectArray
when getObjectData is called, convert from JSON to objectArray, so this on 
demand
When getCSVText is called, convert from JSON to CSVText, so this on demand
When getData is called, return the JSON
Cons:
of course we wont store anything so not much in memory, but depending on how 
JSON IDF will be used, i.e t FROM connector uses JSONIDF and storing data in 
"JSON"
then the TO connector might be HDFS and want to use CSV, so in this case we on 
demand do the CSV conversion for every single record. But it makes sense though

Option #2 - store the data and the CSV, only so we know HDFS is one of the 
major use cases and it might use CSV, but it is no longer true looking at the 
latest code in HDFS connector, it does readText in some cases and readArray in 
some cases

Source of truth is JSON and CSV
When setData is called, store the JSON in data and also convert JSON to CSVText 
and store CSVText
When setObjectData is called, construct CSV and store it the CSVText
When setCSVText is called, construct JSON, and store the csvText
When getObjectData is called use the CSVText and return objectArray, so we can 
share the code in CSVIDF, and hence me moving this logic to base class is 
justified
When getCSVText is called, just return the stored CSVText, so this no-op
When getData is called, return the JSON

> Does CSVIDF implementation need to store data field and CSV text
> ----------------------------------------------------------------
>
>                 Key: SQOOP-1957
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1957
>             Project: Sqoop
>          Issue Type: Sub-task
>          Components: sqoop2-framework
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>             Fix For: no-release
>
>
> can we clean it up? SQOOP-1901 related comment
> https://reviews.apache.org/r/29346/#comment109333



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SQOOP-1957) Does CSVIDF implementation need to store data field and CSV text

Reply via email to