Re: Configurable NULL in IDF or Connector?

Gwen Shapira Mon, 01 Dec 2014 15:29:03 -0800

Sorry, missed the wiki previously. Its very comprehensive, so I can relax now :)


On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <[email protected]> wrote:
> Gwen,
> we’ve investigated mysqldump, pg_dump and few others already, the results are 
> on the wiki [1]. The resulting CSV-ish specification is following those two 
> very closely.
>
> In MySQL case specifically, I’ve looked into mysqldump output rather then 
> “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires the 
> file to exists on the database machine whereas mysqldump/mysqlimport allows 
> us to import data to the database from any machine on the Hadoop cluster.
>
> Jarcec
>
> Links:
> 1: 
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
>
>> On Dec 1, 2014, at 11:55 AM, Gwen Shapira <[email protected]> wrote:
>>
>> Agreed. I hope we'll have at least one direct connector real soon now
>> to prove it.
>>
>> Reading this:
>> http://dev.mysql.com/doc/refman/5.6/en/load-data.html
>> was a bit discouraging...
>>
>> On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]> wrote:
>>> My understanding is that MySQL and PostgreSQL can output to CSV in the
>>> suggested format.
>>>
>>> NOTE: getTextData() and setTextData() APIs are effectively useless if
>>> reduced processing load is not possible.
>>>
>>> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]> wrote:
>>>
>>>> (hijacking the thread a bit for a related point)
>>>>
>>>> I have some misgivings around how we manage the IDF now.
>>>>
>>>> We go with a pretty specific CSV in order to avoid extra-processing
>>>> for MySQL/Postgres direct connectors.
>>>> I think the intent is to allow running LOAD DATA  without any processing.
>>>> Therefore we need to research and document the specific formats
>>>> required by MySQL and Postgres. Both DBs have pretty specific (and
>>>> often funky) formatting they need (If escaping is not used then NULL
>>>> is null, otherwise \N...)
>>>>
>>>> If zero-processing load is not feasible, I'd re-consider the IDF and
>>>> lean toward a more structured format (Avro?).  If the connectors need
>>>> to parse the CSV and modify it, we are not gaining anything here. Or
>>>> at the very least benchmark to validate that CSV+processing is still
>>>> the fastest / least CPU option.
>>>>
>>>> Gwen
>>>>
>>>>
>>>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]>
>>>> wrote:
>>>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define
>>>>> it...
>>>>>
>>>>> Also, for #2... There are a few ways of generating output. It seems NULL
>>>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
>>>> sense.
>>>>>
>>>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I do share the same point of view as Gwen. The CSV format for UDF is
>>>> very
>>>>>> strict so that we have minimal surface area for inconsistencies between
>>>>>> multiple connectors. This is because the IDF is an agreed upon exchange
>>>>>> format when transferring data from one connector to the other. That
>>>> however
>>>>>> shouldn't stop one connector (such as HDFS) to offer ways to save the
>>>>>> resulting CSV differently.
>>>>>>
>>>>>> We had similar discussion about separator and quote characters in
>>>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here.
>>>>>>
>>>>>> Jarcec
>>>>>>
>>>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]>
>>>> wrote:
>>>>>>>
>>>>>>> I think its two different things:
>>>>>>>
>>>>>>> 1. HDFS connector should give more control over the formatting of the
>>>>>>> data in text files (nulls, escaping, etc)
>>>>>>> 2. IDF should give NULLs in a format that is optimized for
>>>>>>> MySQL/Postgres direct connectors (since thats one of the IDF design
>>>>>>> goals).
>>>>>>>
>>>>>>> Gwen
>>>>>>>
>>>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]>
>>>>>> wrote:
>>>>>>>> Hey guys,
>>>>>>>>
>>>>>>>> Any thoughts on where configurable NULL values should be? Either the
>>>>>> IDF or
>>>>>>>> HDFS connector?
>>>>>>>>
>>>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678
>>>>>>>>
>>>>>>>> -Abe
>>>>>>
>>>>>>
>>>>
>

Re: Configurable NULL in IDF or Connector?

Reply via email to