Re: Review Request 28139: Support Time and List Type in CSV IDF

Jarek Cecho Thu, 20 Nov 2014 07:19:43 -0800


On Nov. 19, 2014, 1 a.m., Veena Basavaraj wrote:
> > Jarcec
> 
> Veena Basavaraj wrote:
>     please see the latest revision., it uses JSON. It is pretty universal so 
> I rather use JSON.
> 
> Jarek Cecho wrote:
>     Why JSON for intermediate data format? I would much rather use something 
> that is used by pg_dump and mysqldump (or Hive) rather then JSON. IDF is just 
> intermediate format, we should be as effective here as possible and we should 
> not be doing conversions from format to format.
> 
> Veena Basavaraj wrote:
>     it so happens that they use the same too, I would prefer you look at the 
> patch on why we are using an encoding such as JSON to represent arrays and 
> maps, imagine a case of [[a,b] [z,q] [w,e]] - will be just represented as 
> JSONstring, so when parsing it back to actual arays or arrays object, it is 
> no brainer and we use the JSON parse method, the code also says the CSV IDF 
> to ignore any commans within the JSON string
> 
> Jarek Cecho wrote:
>     As long as pg_dump and Hive are using the same format I'm fine. But I 
> don't want to introduce different format then those two without a good reason.
> 
> Veena Basavaraj wrote:
>     I dont know about hive and if there is no standard all of them use, then 
> we cannot do much.
>     And unless we have hive / post gres connectors using this code, I would 
> not over optimize for it. JSON serializers are pretty fast esp jackson which 
> we shoudl replace it to many places and there is ticket for it as well.
>     I would cross the bridge when we get there, if performance is a concern 
> for certain Data sources, changing this to any other type should be  
> strighforward.
> 
> Jarek Cecho wrote:
>     I believe that part of this JIRA is to go and explore how Hive and 
> pg_dump are serializing data to provide the best way in Sqoop 2. That is what 
> we've done when designing CSV IDF and we should keep doing that. The CSV IDF 
> is not internal as it's exposed as a standard for other connectors and we 
> should not change that in the future in backward incompatible way. Hence we 
> should be careful when introducing new syntax and rules and the due diligence 
> in investigating what is best.
> 
> Veena Basavaraj wrote:
>     I am still not sure what the concern is , how we store the IDF is and 
> should be internal to sqoop. I will let Abe weigh in as well. The fact that 
> is it called IDF means, sqoop is responsible to read from FROM..store it in 
> its IDF and then when it passes data to the outputcommitter/ format ( in case 
> of MR) it uses this and converts it back to the java objects. So If I modify 
> this internally string to CSV to something else, it should not affect any 
> conenctor develoepr.
>     
>     if course if I change what an Array means then it will affect
> 
> Abraham Elmahrek wrote:
>     I agree with Jarcec on this. JSON was a suggestion that I gave, but I do 
> think that we should try to align with pgdump, mysqldump, oracle export to 
> csv, hive, etc. If there is no standard, we should choose the most generic 
> and frequently used one.
> 
> Veena Basavaraj wrote:
>     there are 2 different thigns going on this thread. I am not sure what you 
> agree on.
>     
>     if you agree on using a standard that is used my most, that I do too. and 
> it happens to be JSON
> 
> Veena Basavaraj wrote:
>     So a bit more discsusion with Abe and here is a detailed summary of why I 
> think we should standardize on what the setText shud give and what getText 
> will give in the IDF and that should be for every Column type we support and 
> not jsut arrays.
>     
>     Lets all be on the same page of how we use terminologies. 
>     
>     1. IDF is a abstract class that supports serializing / de ser data in 
> sqoop. It exposes many methods in its apis
>     relevant ones
>      - setTextData getTextData - deals with entire row as one string.
>      - setObjectData  getObjectData - deals with entire row and an object 
> array
>      - setSchema ( with column types) - very relevant when we use the 
> setObjectData, since we want to know what every object in that array is 
> really, a string, a snumber, a map, a list, a enum etc...This information of 
> the types of the columns within the object array is very much required when 
> we ser/de-ser between 2 different types. Read below to see when this column 
> type comes handy
>      - 
>     2. CSV happens to be one such implementation of IDF. But really what 
> makes different impleemntations different is how they treat the object data 
> internally. Read below to understand why I say this
>     
>     
>     Note that the getText and setText method implementations  do not vary per 
> IDF implementation at all. it is dumb method that will put all the burden on 
> the connector to give a correctly formatted string that makes sense for its 
> corresponding TO side to infer what the data is. HDFS to HDFS, is one such 
> trivial use case, where no matter what IDFimplementation I use, it does not 
> matter, since I will write as text and retrieve as text.
>     
>     Since sqoop does not even validate what format the text is, it is upto 
> the hdfs connector to store it any format.
>     
>     
>     The more interesting use cases are this
>     
>     Say thr FROM data source writes the data as a text and the TO data source 
> now wants to read it as a object array to load it into its data base. Now the 
> whole point of schema and column types becomes very important since we are 
> dealing with 2 different types.
>     
>     When the FROM sets the text data, we expect it to be really in a format 
> Sqoop can parse it as objects.
>     
>     Lets take a example:
>     
>     Say I have a table "users" with the field "name" "description" that are 
> varchar. In the schema I map these varchar sto the IDF's TEXT type. 
>     all good so far. @@
>     
>     Now the data for the name is (vee,na), and description is (she is an 
> enginner),  yes the name seems to have a comma within it, totally possible, 
> it can have a quote or any other special character... 
>     
>     Now we are hoping that when vee,na is set, it is properly escaped for 
> comma and then quoted. 
>     
>     This means the connector developers now have to be aware of all these 
> rules. They cannot just dump the data.
>     
>     You may ask why is escaping required.? 
>     
>     So if they did not escape the ",", when sqoop stores this in the text it 
> will store is as it was given ( say the conenctor developer decided to store 
> this as comma separate, they could as well decide to do toJSONString ans give 
> us the string, since we clearly have not not documented what we expect, but 
> thats a different story)...so lets say they do csv
>     
>     vee,na,she is an engineer
>     
>     On the TO side, as I said before, say the data source wants all this data 
> as object array. 
>     
>     So CSV implmentation of IDF will use the "," as a separator, 
>     
>     it will give create 3 array object since it splits it based on the ","!!!
>     
>     object[0] = vee
>     object[1]= na
>     object[2] =she,
>     
>     So this is clearly not what we wanted. The above was very simple use case 
> with strings that fails. Show we just bail out and say the connector did not 
> encode it well, or rather be more proactive and have a standardized format 
> that we will use to endoe every element within the CSV ?
>     
>     Now imagine the use cases with numbers, floating points, maps, arrays and 
> nested arrays. Comma will appear if proper encoding is not used. 
>     
>     
>     So if we standardize and say all the text data shud be encoded as JSON 
> and any text data you will get will be in JSON, then ists pretty simple rule 
> to follow, there is no need of custom regex and escaping, and unescaping and 
> unecoding rules in either connectors not sqoop when we trasnfer data between 
> text and object formats and viceversa
>     
>     
>     After writing this, I feel the name Internal Data Format is compeltely 
> misleading. 
>     
>     For the use cases I talked about where, FROM does setObjectData(Object[]) 
> and to does a getText() the internal format is totally exposed, so it is no 
> longer internal anymore.
>     
>     
>     But if we say the 2 supported uses cases are setText and getText , and 
> setObject and getObject, then the internal representation is truly internal. 
>     
>     But the former is more of a reality than the latter, unless we mandate 
> using object[] array for both extract/load phase


Thank you for the example Veena. This scenario actually should not happen as 
the returned "text" format is documented on our wiki [1]. All connectors are 
required to use this format for text representation and that is why it's so 
important to define it clearly. E.g. the text format is not internal to Sqoop, 
it's exposed to the connector developpers. 

The reason for that is that most of high performant data transfer mechanisms 
(mysqldump, pg_dump, netezza external tables) will get us the data in text and 
we want to take them and with the smallest CPU burder possible get them through 
Sqoop internals to target destionation. Thee most peformant example would be 
database to text on HDFS and in this case we don't want to do any conversions. 
E.g. we don't want to read and construct JSON objects at all. That is why we've 
defined the IDF text format on the wiki by following what is already there and 
I would like to stick to it.

Links:
1: 
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal


- Jarek


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28139/#review62081
-----------------------------------------------------------


On Nov. 19, 2014, 5:08 p.m., Veena Basavaraj wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28139/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2014, 5:08 p.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Bugs: SQOOP-1749
>     https://issues.apache.org/jira/browse/SQOOP-1749
> 
> 
> Repository: sqoop-sqoop2
> 
> 
> Description
> -------
> 
> see JIRA
> 
> Supporting Nested arrays is going to be dody logic
> 
> some of the assumptions here in parsing need more docs. Esp use of special 
> characters to parse nested arrays.
> 
> Welcome feedback if I can improve this more.
> 
> I am sure mores tests can be added.
> 
> 
> NOTE: WS will be addressed once the patch is reviewed for functionality and 
> this needs a rebase since I have other patches pending review that overlap.
> 
> 
> Diffs
> -----
> 
>   
> connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java
>  39a01c1b5b290903a6c93dbb8427af525515e2b2 
>   
> connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormat.java
>  5ef6fc64ec695fbdf2f9b6acec1eb0235fd675be 
>   
> connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/IntermediateDataFormatError.java
>  4d41679d0b4b9e04d05eef4dd5280f201bbe58e2 
>   
> connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/TestCSVIntermediateDataFormat.java
>  fcf6c3c15c6558cf21a0d82c78a51903d8fb371c 
> 
> Diff: https://reviews.apache.org/r/28139/diff/
> 
> 
> Testing
> -------
> 
> yes unit tests added. 
> 
> 
> File Attachments
> ----------------
> 
> SQOOP-1749-v2.patch
>   
> https://reviews.apache.org/media/uploaded/files/2014/11/19/cfdfc0ed-57c9-40d3-959b-f482f1600c03__SQOOP-1749-v2.patch
> 
> 
> Thanks,
> 
> Veena Basavaraj
> 
>

Re: Review Request 28139: Support Time and List Type in CSV IDF

Reply via email to