ah, please do add these details as a comment to the wiki Gwen. I am glad we
discussed this.

Also, the data set size and structure of the data set ( nested types or not
) and the environment (machines) and various other things matter when we
says writing CSV is 20% faster then Avro. But it is tradeoff one should be
able to make and choose what works best for them. In some cases, one might
be willing to tradeoff some speed to a structured format such as Avro, it
might be also the case the the destination format ( the the TO) expects it
to be written in Avro.




Best,
*./Vee*

On Mon, Dec 1, 2014 at 3:31 PM, Gwen Shapira <[email protected]> wrote:

> Performance numbers would be sweet at some point for sure.
> Based on some rough tests we did in the field (on another project),
> Avro serialization does have significant overhead (I think Hive
> writing CSV was 20% faster than to Avro, I can dig up my results
> later). It may be even worse for Sqoop since Hive does serialization
> in batches.
>
> This is not completely scientific, but leads me to believe that as
> much as I love Avro, we'll need a good reason to use it internally.
>
> On Mon, Dec 1, 2014 at 3:19 PM, Veena Basavaraj <[email protected]>
> wrote:
> > Jarcec,
> >
> > If we were more metrics driven/ with some tests and/or benchmarks to
> prove
> > how much fast this would be, it would have been great. Just a suggestion.
> >
> > Gwen probably meant the same as well.
> >
> >
> >
> >
> >
> >
> > Best,
> > *./Vee*
> >
> > On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <[email protected]>
> > wrote:
> >
> >> Gwen,
> >> we’ve investigated mysqldump, pg_dump and few others already, the
> results
> >> are on the wiki [1]. The resulting CSV-ish specification is following
> those
> >> two very closely.
> >>
> >> In MySQL case specifically, I’ve looked into mysqldump output rather
> then
> >> “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires
> >> the file to exists on the database machine whereas mysqldump/mysqlimport
> >> allows us to import data to the database from any machine on the Hadoop
> >> cluster.
> >>
> >> Jarcec
> >>
> >> Links:
> >> 1:
> >>
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
> >>
> >> > On Dec 1, 2014, at 11:55 AM, Gwen Shapira <[email protected]>
> wrote:
> >> >
> >> > Agreed. I hope we'll have at least one direct connector real soon now
> >> > to prove it.
> >> >
> >> > Reading this:
> >> > http://dev.mysql.com/doc/refman/5.6/en/load-data.html
> >> > was a bit discouraging...
> >> >
> >> > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]>
> >> wrote:
> >> >> My understanding is that MySQL and PostgreSQL can output to CSV in
> the
> >> >> suggested format.
> >> >>
> >> >> NOTE: getTextData() and setTextData() APIs are effectively useless if
> >> >> reduced processing load is not possible.
> >> >>
> >> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]
> >
> >> wrote:
> >> >>
> >> >>> (hijacking the thread a bit for a related point)
> >> >>>
> >> >>> I have some misgivings around how we manage the IDF now.
> >> >>>
> >> >>> We go with a pretty specific CSV in order to avoid extra-processing
> >> >>> for MySQL/Postgres direct connectors.
> >> >>> I think the intent is to allow running LOAD DATA  without any
> >> processing.
> >> >>> Therefore we need to research and document the specific formats
> >> >>> required by MySQL and Postgres. Both DBs have pretty specific (and
> >> >>> often funky) formatting they need (If escaping is not used then NULL
> >> >>> is null, otherwise \N...)
> >> >>>
> >> >>> If zero-processing load is not feasible, I'd re-consider the IDF and
> >> >>> lean toward a more structured format (Avro?).  If the connectors
> need
> >> >>> to parse the CSV and modify it, we are not gaining anything here. Or
> >> >>> at the very least benchmark to validate that CSV+processing is still
> >> >>> the fastest / least CPU option.
> >> >>>
> >> >>> Gwen
> >> >>>
> >> >>>
> >> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]
> >
> >> >>> wrote:
> >> >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me
> >> re-define
> >> >>>> it...
> >> >>>>
> >> >>>> Also, for #2... There are a few ways of generating output. It seems
> >> NULL
> >> >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
> >> >>> sense.
> >> >>>>
> >> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <
> >> [email protected]>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> I do share the same point of view as Gwen. The CSV format for UDF
> is
> >> >>> very
> >> >>>>> strict so that we have minimal surface area for inconsistencies
> >> between
> >> >>>>> multiple connectors. This is because the IDF is an agreed upon
> >> exchange
> >> >>>>> format when transferring data from one connector to the other.
> That
> >> >>> however
> >> >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save
> the
> >> >>>>> resulting CSV differently.
> >> >>>>>
> >> >>>>> We had similar discussion about separator and quote characters in
> >> >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here.
> >> >>>>>
> >> >>>>> Jarcec
> >> >>>>>
> >> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]
> >
> >> >>> wrote:
> >> >>>>>>
> >> >>>>>> I think its two different things:
> >> >>>>>>
> >> >>>>>> 1. HDFS connector should give more control over the formatting of
> >> the
> >> >>>>>> data in text files (nulls, escaping, etc)
> >> >>>>>> 2. IDF should give NULLs in a format that is optimized for
> >> >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF
> design
> >> >>>>>> goals).
> >> >>>>>>
> >> >>>>>> Gwen
> >> >>>>>>
> >> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <
> [email protected]>
> >> >>>>> wrote:
> >> >>>>>>> Hey guys,
> >> >>>>>>>
> >> >>>>>>> Any thoughts on where configurable NULL values should be? Either
> >> the
> >> >>>>> IDF or
> >> >>>>>>> HDFS connector?
> >> >>>>>>>
> >> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678
> >> >>>>>>>
> >> >>>>>>> -Abe
> >> >>>>>
> >> >>>>>
> >> >>>
> >>
> >>
>

Reply via email to