[ https://issues.apache.org/jira/browse/SQOOP-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224824#comment-14224824 ]
Veena Basavaraj edited comment on SQOOP-1771 at 11/25/14 5:48 PM: ------------------------------------------------------------------ I have moved the #1 discussion to https://issues.apache.org/jira/browse/SQOOP-1811 since it unrelated to this original JIRA was (Author: vybs): to me more concrete #1. the following code can move to the base class and made final, so there is no way to override this {code} // hold the string public final String getCSVTextData() { return text; } public final void setCSVTextData(String text) { this.text = text; } {code} There is code in CSVIDF implementation that has the rules for CSV parsing that can be pulled out into CSV Utils so that the connectors can use The T in CSV happens to String, which is just a coincidence, If I write a new IDF implementation T can be a custom object that could encapsulate the whole row. Third, getData and setData can have custom implementation so they can be overriden to return the generic type T > Investigation CSV IDF FORMAT of the Array/NestedArray/ Set/ Map in Postgres > and HIVE. > ------------------------------------------------------------------------------------- > > Key: SQOOP-1771 > URL: https://issues.apache.org/jira/browse/SQOOP-1771 > Project: Sqoop > Issue Type: Sub-task > Components: sqoop2-framework > Reporter: Veena Basavaraj > Fix For: 1.99.5 > > > https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal > The above document does not explicitly say the design goals for choosing the > CSV IDF format for different types but with conversation on of the related > tickets RB : https://reviews.apache.org/r/28139/diff/#. Here are the > considerations. > Intermediate Data Format is more relevant when we transfer data between the > FROM and TO and both do not agree on the same form of data as it is > transferred via sqoop. > The IDF API as of today exposes 3 types of setter, one for a generic type T, > one for Text/String, one for object array. > {code} > /** > * Set one row of data. If validate is set to true, the data is validated > * against the schema. > * @param data - A single row of data to be moved. > */ > public void setData(T data) { > this.data = data; > } > /** > * Get one row of data. > * > * @return - One row of data, represented in the internal/native format of > * the intermediate data format implementation. > */ > public T getData() { > return data; > } > /** > * Get one row of data as CSV. > * > * @return - String representing the data in CSV, according to the "FROM" > schema. > * No schema conversion is done on textData, to keep it as "high > performance" option. > */ > public abstract String getTextData(); > /** > * Set one row of data as CSV. > * > */ > public abstract void setTextData(String text); > /** > * Get one row of data as an Object array. > * > * @return - String representing the data as an Object array > * If FROM and TO schema exist, we will use SchemaMatcher to get the data > according to "TO" schema > */ > public abstract Object[] getObjectData(); > /** > * Set one row of data as an Object array. > * > */ > public abstract void setObjectData(Object[] data); > /** > {code} > NOTE : the java docs are not completely accurate, there is really no > validation happening:). Second CSV in one way the IDF can be represented when > it is TEXT.There can be other implementation of IDF as well such as AVRO or > JSON, very similar to the serDe interface in HIVE that allows custom ways to > store data, but in SQOOP it is custom ways to represent data as it flows vis > SQOOP. Another java doc fix... " String representing the data in CSV, > according to the "FROM" schema. * No schema conversion is done on textData, > to keep it as "high performance" option.", this also is not accurate. The CSV > format is a standard enforced by sqoop implementation, there is no one > STANDARD CSV for all data types esp with nested types.. The FROM schema does > not enforce any standard.. > Anyways, so the design considerations for the CSV IDF implementation seems to > be the following. As I said before other IDF implementation can have other > design goals and can be chosen by a particular connector to benefits data in > and out of itself the most. > 1. the setText/ getText are supposed to allow the FROM and TO to talk the > same language and hence should have very minimal transformations as the data > flows through SQOOP. This means that both FROM and TO agree to give data in > the CSV IDF that is standardized in the wiki / spec/ docs and the read data > in the same format. Transformation may have to happen before the setText() or > after the getText, but nothing will happen in between when it flows through > sqoop. If the FROM does a setText and the TO does a getObject then there is > time spent it converting the elements within the CSV string to actual java > objects. This means there is parsing and unescaping / unencoding happening in > sqoop. > 2. The current proposal seems to recommend the formats that are more > prominent with the databases that have been explored in the list, but it is > not really a complete set of all data sources/connectors sqoop may have in > future. Most emphasis is on the relational DB stores since historically > sqoop1 only supported that as the FROM source > https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal > But overall the goal seem to be more on the side of sql dump and pg dump that > use CSV format and the hope is such transfers in sqoop will happen more. > 3. Avoiding any CPU cycles, there is no validation that will done to make > sure that the data adheres to the CSV format. It is trust based system that > the incoming data will follow the CSV rules as depicted in the link above > https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa > Next, having know these design goals, the format to encode the nested arrays > and maps can be done in some ways. > 2 examples were explored below. HIVE and postgres. Details are given below in > comments. One of the simplest ways was to use the universal JSON jackson api > for nested arrays and maps. > Postgres format is very similar to that but just needs more hand-rolling > instead of relying on a standard JSON library. both for arrays and map, this > format can be used as a standard. Between this and actually using jackson > object mapper, the performance differences are highly unlikely to be > different. > I would still prefer using a standard JSON library for encoding maps and > nested arrays, so that the connectors can use the same standard as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)