Jarcec, If we were more metrics driven/ with some tests and/or benchmarks to prove how much fast this would be, it would have been great. Just a suggestion.
Gwen probably meant the same as well. Best, *./Vee* On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <[email protected]> wrote: > Gwen, > we’ve investigated mysqldump, pg_dump and few others already, the results > are on the wiki [1]. The resulting CSV-ish specification is following those > two very closely. > > In MySQL case specifically, I’ve looked into mysqldump output rather then > “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires > the file to exists on the database machine whereas mysqldump/mysqlimport > allows us to import data to the database from any machine on the Hadoop > cluster. > > Jarcec > > Links: > 1: > https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation > > > On Dec 1, 2014, at 11:55 AM, Gwen Shapira <[email protected]> wrote: > > > > Agreed. I hope we'll have at least one direct connector real soon now > > to prove it. > > > > Reading this: > > http://dev.mysql.com/doc/refman/5.6/en/load-data.html > > was a bit discouraging... > > > > On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]> > wrote: > >> My understanding is that MySQL and PostgreSQL can output to CSV in the > >> suggested format. > >> > >> NOTE: getTextData() and setTextData() APIs are effectively useless if > >> reduced processing load is not possible. > >> > >> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]> > wrote: > >> > >>> (hijacking the thread a bit for a related point) > >>> > >>> I have some misgivings around how we manage the IDF now. > >>> > >>> We go with a pretty specific CSV in order to avoid extra-processing > >>> for MySQL/Postgres direct connectors. > >>> I think the intent is to allow running LOAD DATA without any > processing. > >>> Therefore we need to research and document the specific formats > >>> required by MySQL and Postgres. Both DBs have pretty specific (and > >>> often funky) formatting they need (If escaping is not used then NULL > >>> is null, otherwise \N...) > >>> > >>> If zero-processing load is not feasible, I'd re-consider the IDF and > >>> lean toward a more structured format (Avro?). If the connectors need > >>> to parse the CSV and modify it, we are not gaining anything here. Or > >>> at the very least benchmark to validate that CSV+processing is still > >>> the fastest / least CPU option. > >>> > >>> Gwen > >>> > >>> > >>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]> > >>> wrote: > >>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me > re-define > >>>> it... > >>>> > >>>> Also, for #2... There are a few ways of generating output. It seems > NULL > >>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes > >>> sense. > >>>> > >>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho < > [email protected]> > >>>> wrote: > >>>> > >>>>> I do share the same point of view as Gwen. The CSV format for UDF is > >>> very > >>>>> strict so that we have minimal surface area for inconsistencies > between > >>>>> multiple connectors. This is because the IDF is an agreed upon > exchange > >>>>> format when transferring data from one connector to the other. That > >>> however > >>>>> shouldn't stop one connector (such as HDFS) to offer ways to save the > >>>>> resulting CSV differently. > >>>>> > >>>>> We had similar discussion about separator and quote characters in > >>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here. > >>>>> > >>>>> Jarcec > >>>>> > >>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]> > >>> wrote: > >>>>>> > >>>>>> I think its two different things: > >>>>>> > >>>>>> 1. HDFS connector should give more control over the formatting of > the > >>>>>> data in text files (nulls, escaping, etc) > >>>>>> 2. IDF should give NULLs in a format that is optimized for > >>>>>> MySQL/Postgres direct connectors (since thats one of the IDF design > >>>>>> goals). > >>>>>> > >>>>>> Gwen > >>>>>> > >>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]> > >>>>> wrote: > >>>>>>> Hey guys, > >>>>>>> > >>>>>>> Any thoughts on where configurable NULL values should be? Either > the > >>>>> IDF or > >>>>>>> HDFS connector? > >>>>>>> > >>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678 > >>>>>>> > >>>>>>> -Abe > >>>>> > >>>>> > >>> > >
