Re: Configurable NULL in IDF or Connector?

Gwen Shapira Mon, 01 Dec 2014 15:45:07 -0800

100% agree on the trade-offs. Thats why I said "We'll need a good
reason". Basically connectors or features that will leverage Avro
benefits.


Will add my notes to the wiki once I find my CSV vs. Avro numbers.

On Mon, Dec 1, 2014 at 3:37 PM, Veena Basavaraj <[email protected]> wrote:
> ah, please do add these details as a comment to the wiki Gwen. I am glad we
> discussed this.
>
> Also, the data set size and structure of the data set ( nested types or not
> ) and the environment (machines) and various other things matter when we
> says writing CSV is 20% faster then Avro. But it is tradeoff one should be
> able to make and choose what works best for them. In some cases, one might
> be willing to tradeoff some speed to a structured format such as Avro, it
> might be also the case the the destination format ( the the TO) expects it
> to be written in Avro.
>
>
>
>
> Best,
> *./Vee*
>
> On Mon, Dec 1, 2014 at 3:31 PM, Gwen Shapira <[email protected]> wrote:
>
>> Performance numbers would be sweet at some point for sure.
>> Based on some rough tests we did in the field (on another project),
>> Avro serialization does have significant overhead (I think Hive
>> writing CSV was 20% faster than to Avro, I can dig up my results
>> later). It may be even worse for Sqoop since Hive does serialization
>> in batches.
>>
>> This is not completely scientific, but leads me to believe that as
>> much as I love Avro, we'll need a good reason to use it internally.
>>
>> On Mon, Dec 1, 2014 at 3:19 PM, Veena Basavaraj <[email protected]>
>> wrote:
>> > Jarcec,
>> >
>> > If we were more metrics driven/ with some tests and/or benchmarks to
>> prove
>> > how much fast this would be, it would have been great. Just a suggestion.
>> >
>> > Gwen probably meant the same as well.
>> >
>> >
>> >
>> >
>> >
>> >
>> > Best,
>> > *./Vee*
>> >
>> > On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <[email protected]>
>> > wrote:
>> >
>> >> Gwen,
>> >> we’ve investigated mysqldump, pg_dump and few others already, the
>> results
>> >> are on the wiki [1]. The resulting CSV-ish specification is following
>> those
>> >> two very closely.
>> >>
>> >> In MySQL case specifically, I’ve looked into mysqldump output rather
>> then
>> >> “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires
>> >> the file to exists on the database machine whereas mysqldump/mysqlimport
>> >> allows us to import data to the database from any machine on the Hadoop
>> >> cluster.
>> >>
>> >> Jarcec
>> >>
>> >> Links:
>> >> 1:
>> >>
>> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
>> >>
>> >> > On Dec 1, 2014, at 11:55 AM, Gwen Shapira <[email protected]>
>> wrote:
>> >> >
>> >> > Agreed. I hope we'll have at least one direct connector real soon now
>> >> > to prove it.
>> >> >
>> >> > Reading this:
>> >> > http://dev.mysql.com/doc/refman/5.6/en/load-data.html
>> >> > was a bit discouraging...
>> >> >
>> >> > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]>
>> >> wrote:
>> >> >> My understanding is that MySQL and PostgreSQL can output to CSV in
>> the
>> >> >> suggested format.
>> >> >>
>> >> >> NOTE: getTextData() and setTextData() APIs are effectively useless if
>> >> >> reduced processing load is not possible.
>> >> >>
>> >> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]
>> >
>> >> wrote:
>> >> >>
>> >> >>> (hijacking the thread a bit for a related point)
>> >> >>>
>> >> >>> I have some misgivings around how we manage the IDF now.
>> >> >>>
>> >> >>> We go with a pretty specific CSV in order to avoid extra-processing
>> >> >>> for MySQL/Postgres direct connectors.
>> >> >>> I think the intent is to allow running LOAD DATA  without any
>> >> processing.
>> >> >>> Therefore we need to research and document the specific formats
>> >> >>> required by MySQL and Postgres. Both DBs have pretty specific (and
>> >> >>> often funky) formatting they need (If escaping is not used then NULL
>> >> >>> is null, otherwise \N...)
>> >> >>>
>> >> >>> If zero-processing load is not feasible, I'd re-consider the IDF and
>> >> >>> lean toward a more structured format (Avro?).  If the connectors
>> need
>> >> >>> to parse the CSV and modify it, we are not gaining anything here. Or
>> >> >>> at the very least benchmark to validate that CSV+processing is still
>> >> >>> the fastest / least CPU option.
>> >> >>>
>> >> >>> Gwen
>> >> >>>
>> >> >>>
>> >> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]
>> >
>> >> >>> wrote:
>> >> >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me
>> >> re-define
>> >> >>>> it...
>> >> >>>>
>> >> >>>> Also, for #2... There are a few ways of generating output. It seems
>> >> NULL
>> >> >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
>> >> >>> sense.
>> >> >>>>
>> >> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <
>> >> [email protected]>
>> >> >>>> wrote:
>> >> >>>>
>> >> >>>>> I do share the same point of view as Gwen. The CSV format for UDF
>> is
>> >> >>> very
>> >> >>>>> strict so that we have minimal surface area for inconsistencies
>> >> between
>> >> >>>>> multiple connectors. This is because the IDF is an agreed upon
>> >> exchange
>> >> >>>>> format when transferring data from one connector to the other.
>> That
>> >> >>> however
>> >> >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save
>> the
>> >> >>>>> resulting CSV differently.
>> >> >>>>>
>> >> >>>>> We had similar discussion about separator and quote characters in
>> >> >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here.
>> >> >>>>>
>> >> >>>>> Jarcec
>> >> >>>>>
>> >> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]
>> >
>> >> >>> wrote:
>> >> >>>>>>
>> >> >>>>>> I think its two different things:
>> >> >>>>>>
>> >> >>>>>> 1. HDFS connector should give more control over the formatting of
>> >> the
>> >> >>>>>> data in text files (nulls, escaping, etc)
>> >> >>>>>> 2. IDF should give NULLs in a format that is optimized for
>> >> >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF
>> design
>> >> >>>>>> goals).
>> >> >>>>>>
>> >> >>>>>> Gwen
>> >> >>>>>>
>> >> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <
>> [email protected]>
>> >> >>>>> wrote:
>> >> >>>>>>> Hey guys,
>> >> >>>>>>>
>> >> >>>>>>> Any thoughts on where configurable NULL values should be? Either
>> >> the
>> >> >>>>> IDF or
>> >> >>>>>>> HDFS connector?
>> >> >>>>>>>
>> >> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678
>> >> >>>>>>>
>> >> >>>>>>> -Abe
>> >> >>>>>
>> >> >>>>>
>> >> >>>
>> >>
>> >>
>>

Re: Configurable NULL in IDF or Connector?

Reply via email to