100% agree on the trade-offs. Thats why I said "We'll need a good reason". Basically connectors or features that will leverage Avro benefits.
Will add my notes to the wiki once I find my CSV vs. Avro numbers. On Mon, Dec 1, 2014 at 3:37 PM, Veena Basavaraj <[email protected]> wrote: > ah, please do add these details as a comment to the wiki Gwen. I am glad we > discussed this. > > Also, the data set size and structure of the data set ( nested types or not > ) and the environment (machines) and various other things matter when we > says writing CSV is 20% faster then Avro. But it is tradeoff one should be > able to make and choose what works best for them. In some cases, one might > be willing to tradeoff some speed to a structured format such as Avro, it > might be also the case the the destination format ( the the TO) expects it > to be written in Avro. > > > > > Best, > *./Vee* > > On Mon, Dec 1, 2014 at 3:31 PM, Gwen Shapira <[email protected]> wrote: > >> Performance numbers would be sweet at some point for sure. >> Based on some rough tests we did in the field (on another project), >> Avro serialization does have significant overhead (I think Hive >> writing CSV was 20% faster than to Avro, I can dig up my results >> later). It may be even worse for Sqoop since Hive does serialization >> in batches. >> >> This is not completely scientific, but leads me to believe that as >> much as I love Avro, we'll need a good reason to use it internally. >> >> On Mon, Dec 1, 2014 at 3:19 PM, Veena Basavaraj <[email protected]> >> wrote: >> > Jarcec, >> > >> > If we were more metrics driven/ with some tests and/or benchmarks to >> prove >> > how much fast this would be, it would have been great. Just a suggestion. >> > >> > Gwen probably meant the same as well. >> > >> > >> > >> > >> > >> > >> > Best, >> > *./Vee* >> > >> > On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <[email protected]> >> > wrote: >> > >> >> Gwen, >> >> we’ve investigated mysqldump, pg_dump and few others already, the >> results >> >> are on the wiki [1]. The resulting CSV-ish specification is following >> those >> >> two very closely. >> >> >> >> In MySQL case specifically, I’ve looked into mysqldump output rather >> then >> >> “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires >> >> the file to exists on the database machine whereas mysqldump/mysqlimport >> >> allows us to import data to the database from any machine on the Hadoop >> >> cluster. >> >> >> >> Jarcec >> >> >> >> Links: >> >> 1: >> >> >> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation >> >> >> >> > On Dec 1, 2014, at 11:55 AM, Gwen Shapira <[email protected]> >> wrote: >> >> > >> >> > Agreed. I hope we'll have at least one direct connector real soon now >> >> > to prove it. >> >> > >> >> > Reading this: >> >> > http://dev.mysql.com/doc/refman/5.6/en/load-data.html >> >> > was a bit discouraging... >> >> > >> >> > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]> >> >> wrote: >> >> >> My understanding is that MySQL and PostgreSQL can output to CSV in >> the >> >> >> suggested format. >> >> >> >> >> >> NOTE: getTextData() and setTextData() APIs are effectively useless if >> >> >> reduced processing load is not possible. >> >> >> >> >> >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected] >> > >> >> wrote: >> >> >> >> >> >>> (hijacking the thread a bit for a related point) >> >> >>> >> >> >>> I have some misgivings around how we manage the IDF now. >> >> >>> >> >> >>> We go with a pretty specific CSV in order to avoid extra-processing >> >> >>> for MySQL/Postgres direct connectors. >> >> >>> I think the intent is to allow running LOAD DATA without any >> >> processing. >> >> >>> Therefore we need to research and document the specific formats >> >> >>> required by MySQL and Postgres. Both DBs have pretty specific (and >> >> >>> often funky) formatting they need (If escaping is not used then NULL >> >> >>> is null, otherwise \N...) >> >> >>> >> >> >>> If zero-processing load is not feasible, I'd re-consider the IDF and >> >> >>> lean toward a more structured format (Avro?). If the connectors >> need >> >> >>> to parse the CSV and modify it, we are not gaining anything here. Or >> >> >>> at the very least benchmark to validate that CSV+processing is still >> >> >>> the fastest / least CPU option. >> >> >>> >> >> >>> Gwen >> >> >>> >> >> >>> >> >> >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected] >> > >> >> >>> wrote: >> >> >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me >> >> re-define >> >> >>>> it... >> >> >>>> >> >> >>>> Also, for #2... There are a few ways of generating output. It seems >> >> NULL >> >> >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes >> >> >>> sense. >> >> >>>> >> >> >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho < >> >> [email protected]> >> >> >>>> wrote: >> >> >>>> >> >> >>>>> I do share the same point of view as Gwen. The CSV format for UDF >> is >> >> >>> very >> >> >>>>> strict so that we have minimal surface area for inconsistencies >> >> between >> >> >>>>> multiple connectors. This is because the IDF is an agreed upon >> >> exchange >> >> >>>>> format when transferring data from one connector to the other. >> That >> >> >>> however >> >> >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save >> the >> >> >>>>> resulting CSV differently. >> >> >>>>> >> >> >>>>> We had similar discussion about separator and quote characters in >> >> >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here. >> >> >>>>> >> >> >>>>> Jarcec >> >> >>>>> >> >> >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected] >> > >> >> >>> wrote: >> >> >>>>>> >> >> >>>>>> I think its two different things: >> >> >>>>>> >> >> >>>>>> 1. HDFS connector should give more control over the formatting of >> >> the >> >> >>>>>> data in text files (nulls, escaping, etc) >> >> >>>>>> 2. IDF should give NULLs in a format that is optimized for >> >> >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF >> design >> >> >>>>>> goals). >> >> >>>>>> >> >> >>>>>> Gwen >> >> >>>>>> >> >> >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek < >> [email protected]> >> >> >>>>> wrote: >> >> >>>>>>> Hey guys, >> >> >>>>>>> >> >> >>>>>>> Any thoughts on where configurable NULL values should be? Either >> >> the >> >> >>>>> IDF or >> >> >>>>>>> HDFS connector? >> >> >>>>>>> >> >> >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >> >> >>>>>>> >> >> >>>>>>> -Abe >> >> >>>>> >> >> >>>>> >> >> >>> >> >> >> >> >>
