Re: Configurable NULL in IDF or Connector?

Veena Basavaraj Mon, 01 Dec 2014 15:21:06 -0800

Jarcec,

If we were more metrics driven/ with some tests and/or benchmarks to prove
how much fast this would be, it would have been great. Just a suggestion.


Gwen probably meant the same as well.






Best,
*./Vee*

On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <[email protected]>
wrote:

> Gwen,
> we’ve investigated mysqldump, pg_dump and few others already, the results
> are on the wiki [1]. The resulting CSV-ish specification is following those
> two very closely.
>
> In MySQL case specifically, I’ve looked into mysqldump output rather then
> “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires
> the file to exists on the database machine whereas mysqldump/mysqlimport
> allows us to import data to the database from any machine on the Hadoop
> cluster.
>
> Jarcec
>
> Links:
> 1:
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
>
> > On Dec 1, 2014, at 11:55 AM, Gwen Shapira <[email protected]> wrote:
> >
> > Agreed. I hope we'll have at least one direct connector real soon now
> > to prove it.
> >
> > Reading this:
> > http://dev.mysql.com/doc/refman/5.6/en/load-data.html
> > was a bit discouraging...
> >
> > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]>
> wrote:
> >> My understanding is that MySQL and PostgreSQL can output to CSV in the
> >> suggested format.
> >>
> >> NOTE: getTextData() and setTextData() APIs are effectively useless if
> >> reduced processing load is not possible.
> >>
> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]>
> wrote:
> >>
> >>> (hijacking the thread a bit for a related point)
> >>>
> >>> I have some misgivings around how we manage the IDF now.
> >>>
> >>> We go with a pretty specific CSV in order to avoid extra-processing
> >>> for MySQL/Postgres direct connectors.
> >>> I think the intent is to allow running LOAD DATA  without any
> processing.
> >>> Therefore we need to research and document the specific formats
> >>> required by MySQL and Postgres. Both DBs have pretty specific (and
> >>> often funky) formatting they need (If escaping is not used then NULL
> >>> is null, otherwise \N...)
> >>>
> >>> If zero-processing load is not feasible, I'd re-consider the IDF and
> >>> lean toward a more structured format (Avro?).  If the connectors need
> >>> to parse the CSV and modify it, we are not gaining anything here. Or
> >>> at the very least benchmark to validate that CSV+processing is still
> >>> the fastest / least CPU option.
> >>>
> >>> Gwen
> >>>
> >>>
> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]>
> >>> wrote:
> >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me
> re-define
> >>>> it...
> >>>>
> >>>> Also, for #2... There are a few ways of generating output. It seems
> NULL
> >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
> >>> sense.
> >>>>
> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <
> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> I do share the same point of view as Gwen. The CSV format for UDF is
> >>> very
> >>>>> strict so that we have minimal surface area for inconsistencies
> between
> >>>>> multiple connectors. This is because the IDF is an agreed upon
> exchange
> >>>>> format when transferring data from one connector to the other. That
> >>> however
> >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save the
> >>>>> resulting CSV differently.
> >>>>>
> >>>>> We had similar discussion about separator and quote characters in
> >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here.
> >>>>>
> >>>>> Jarcec
> >>>>>
> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]>
> >>> wrote:
> >>>>>>
> >>>>>> I think its two different things:
> >>>>>>
> >>>>>> 1. HDFS connector should give more control over the formatting of
> the
> >>>>>> data in text files (nulls, escaping, etc)
> >>>>>> 2. IDF should give NULLs in a format that is optimized for
> >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF design
> >>>>>> goals).
> >>>>>>
> >>>>>> Gwen
> >>>>>>
> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]>
> >>>>> wrote:
> >>>>>>> Hey guys,
> >>>>>>>
> >>>>>>> Any thoughts on where configurable NULL values should be? Either
> the
> >>>>> IDF or
> >>>>>>> HDFS connector?
> >>>>>>>
> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678
> >>>>>>>
> >>>>>>> -Abe
> >>>>>
> >>>>>
> >>>
>
>

Re: Configurable NULL in IDF or Connector?

Reply via email to