Map in Postgres and HIVE.

Veena Basavaraj (JIRA) Tue, 25 Nov 2014 09:49:51 -0800

    [ 
https://issues.apache.org/jira/browse/SQOOP-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224824#comment-14224824
 ]


Veena Basavaraj edited comment on SQOOP-1771 at 11/25/14 5:48 PM:
------------------------------------------------------------------

I have moved the #1 discussion to 
https://issues.apache.org/jira/browse/SQOOP-1811 since it unrelated to this 
original JIRA


was (Author: vybs):
to me more concrete #1. the following code can move to the base class and made 
final, so there is no way to override this

{code}

// hold the string
 
  public final String getCSVTextData() {
    return text;
  }

 
  public final void setCSVTextData(String text) {
    this.text = text;
  }


{code}

There is code in CSVIDF implementation that has the rules for CSV parsing that 
can be pulled out into CSV Utils so that the connectors can use 

The T in CSV happens to String, which is just a coincidence, If I write a new 
IDF implementation T can be a custom object that could encapsulate the whole 
row.

Third,  getData and setData can have custom implementation so they can be 
overriden to return the generic type T 

> Investigation CSV IDF FORMAT of the Array/NestedArray/ Set/ Map in Postgres 
> and HIVE.
> -------------------------------------------------------------------------------------
>
>                 Key: SQOOP-1771
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1771
>             Project: Sqoop
>          Issue Type: Sub-task
>          Components: sqoop2-framework
>            Reporter: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> The above document does not explicitly say the design goals for choosing the 
> CSV IDF format for different types but with conversation on of the related 
> tickets  RB : https://reviews.apache.org/r/28139/diff/#. Here are the 
> considerations.
> Intermediate Data Format is more relevant when we transfer data between the 
> FROM and TO and both do not agree on the same form of data as it is 
> transferred via sqoop.
> The IDF API as of today exposes 3 types of setter, one for a generic type T, 
> one for Text/String, one for object array.  
> {code}
>   /**
>    * Set one row of data. If validate is set to true, the data is validated
>    * against the schema.
>    * @param data - A single row of data to be moved.
>    */
>   public void setData(T data) {
>     this.data = data;
>   }
>   /**
>    * Get one row of data.
>    *
>    * @return - One row of data, represented in the internal/native format of
>    *         the intermediate data format implementation.
>    */
>   public T getData() {
>     return data;
>   }
>   /**
>    * Get one row of data as CSV.
>    *
>    * @return - String representing the data in CSV, according to the "FROM" 
> schema.
>    * No schema conversion is done on textData, to keep it as "high 
> performance" option.
>    */
>   public abstract String getTextData();
>   /**
>    * Set one row of data as CSV.
>    *
>    */
>   public abstract void setTextData(String text); 
> /**
>    * Get one row of data as an Object array.
>    *
>    * @return - String representing the data as an Object array
>    * If FROM and TO schema exist, we will use SchemaMatcher to get the data 
> according to "TO" schema
>    */
>   public abstract Object[] getObjectData();
>   /**
>    * Set one row of data as an Object array.
>    *
>    */
>   public abstract void setObjectData(Object[] data);
>   /**
> {code} 
> NOTE : the java docs are not completely accurate, there is really no 
> validation happening:). Second CSV in one way the IDF can be represented when 
> it is TEXT.There can be other implementation of IDF as well such as AVRO or 
> JSON, very similar to the serDe interface in HIVE that allows custom ways to 
> store data, but in SQOOP it is custom ways to represent data as it flows vis 
> SQOOP.  Another java doc fix... " String representing the data in CSV, 
> according to the "FROM" schema.  * No schema conversion is done on textData, 
> to keep it as "high performance" option.", this also is not accurate. The CSV 
> format is a standard enforced by sqoop implementation, there is no one 
> STANDARD CSV for all data types esp with nested types.. The FROM schema does 
> not enforce any standard..
> Anyways, so the design considerations for the CSV IDF implementation seems to 
> be the following. As I said before other IDF implementation can have other 
> design goals and can be chosen by a particular connector to benefits data in 
> and out of itself the most.
> 1. the setText/ getText are supposed to allow the FROM and TO to talk the 
> same language and hence should have very minimal transformations as the data 
> flows through SQOOP. This means that both FROM and TO agree to give data in 
> the CSV IDF that is standardized in the wiki / spec/ docs and the read data 
> in the same format. Transformation may have to happen before the setText() or 
> after the getText, but nothing will happen in between when it flows through 
> sqoop. If the FROM does a setText and the TO does a getObject then there is 
> time spent it converting the elements within the CSV string to actual java 
> objects. This means there is parsing and unescaping / unencoding happening in 
> sqoop.
> 2. The current proposal seems to recommend the formats that are more 
> prominent with the databases that have been explored in the list, but it is 
> not really a complete set of all data sources/connectors sqoop may have in 
> future. Most emphasis is on the relational DB stores since historically 
> sqoop1 only supported that as the FROM source
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> But overall the goal seem to be more on the side of sql dump and pg dump that 
> use CSV format and the hope is such transfers in sqoop will happen more.
> 3. Avoiding any CPU cycles, there is no validation that will done to make 
> sure that the data adheres to the CSV format. It is trust based system that 
> the incoming data will follow the CSV rules as depicted in the link above 
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa
> Next, having know these design goals, the format to encode the nested arrays 
> and maps can be done in some ways. 
> 2 examples were explored below. HIVE and postgres. Details are given below in 
> comments. One of the simplest ways was to use the universal JSON jackson api 
> for nested arrays and maps.
> Postgres format  is very similar to that but just needs more hand-rolling 
> instead of relying on a standard JSON library.  both for arrays and map, this 
> format can be used as a standard.  Between this and actually using jackson 
> object mapper, the performance differences are highly unlikely to be 
> different. 
> I would still prefer using a standard JSON library for encoding maps and 
> nested arrays, so that the connectors can  use the same standard as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (SQOOP-1771) Investigation CSV IDF FORMAT of the Array/NestedArray/ Set/ Map in Postgres and HIVE.

Reply via email to