Sorry, missed the wiki previously. Its very comprehensive, so I can relax now :)
On Mon, Dec 1, 2014 at 3:16 PM, Jarek Jarcec Cecho <[email protected]> wrote: > Gwen, > we’ve investigated mysqldump, pg_dump and few others already, the results are > on the wiki [1]. The resulting CSV-ish specification is following those two > very closely. > > In MySQL case specifically, I’ve looked into mysqldump output rather then > “LOAD DATA”/“SELECT INTO OUTFILE" statement because “LOAD DATA” requires the > file to exists on the database machine whereas mysqldump/mysqlimport allows > us to import data to the database from any machine on the Hadoop cluster. > > Jarcec > > Links: > 1: > https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation > >> On Dec 1, 2014, at 11:55 AM, Gwen Shapira <[email protected]> wrote: >> >> Agreed. I hope we'll have at least one direct connector real soon now >> to prove it. >> >> Reading this: >> http://dev.mysql.com/doc/refman/5.6/en/load-data.html >> was a bit discouraging... >> >> On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]> wrote: >>> My understanding is that MySQL and PostgreSQL can output to CSV in the >>> suggested format. >>> >>> NOTE: getTextData() and setTextData() APIs are effectively useless if >>> reduced processing load is not possible. >>> >>> On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]> wrote: >>> >>>> (hijacking the thread a bit for a related point) >>>> >>>> I have some misgivings around how we manage the IDF now. >>>> >>>> We go with a pretty specific CSV in order to avoid extra-processing >>>> for MySQL/Postgres direct connectors. >>>> I think the intent is to allow running LOAD DATA without any processing. >>>> Therefore we need to research and document the specific formats >>>> required by MySQL and Postgres. Both DBs have pretty specific (and >>>> often funky) formatting they need (If escaping is not used then NULL >>>> is null, otherwise \N...) >>>> >>>> If zero-processing load is not feasible, I'd re-consider the IDF and >>>> lean toward a more structured format (Avro?). If the connectors need >>>> to parse the CSV and modify it, we are not gaining anything here. Or >>>> at the very least benchmark to validate that CSV+processing is still >>>> the fastest / least CPU option. >>>> >>>> Gwen >>>> >>>> >>>> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]> >>>> wrote: >>>>> Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define >>>>> it... >>>>> >>>>> Also, for #2... There are a few ways of generating output. It seems NULL >>>>> values range from "\N" to 0x0 to "NULL". I think keeping NULL makes >>>> sense. >>>>> >>>>> On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <[email protected]> >>>>> wrote: >>>>> >>>>>> I do share the same point of view as Gwen. The CSV format for UDF is >>>> very >>>>>> strict so that we have minimal surface area for inconsistencies between >>>>>> multiple connectors. This is because the IDF is an agreed upon exchange >>>>>> format when transferring data from one connector to the other. That >>>> however >>>>>> shouldn't stop one connector (such as HDFS) to offer ways to save the >>>>>> resulting CSV differently. >>>>>> >>>>>> We had similar discussion about separator and quote characters in >>>>>> SQOOP-1522 that seems to be relevant to the NULL discussion here. >>>>>> >>>>>> Jarcec >>>>>> >>>>>>> On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]> >>>> wrote: >>>>>>> >>>>>>> I think its two different things: >>>>>>> >>>>>>> 1. HDFS connector should give more control over the formatting of the >>>>>>> data in text files (nulls, escaping, etc) >>>>>>> 2. IDF should give NULLs in a format that is optimized for >>>>>>> MySQL/Postgres direct connectors (since thats one of the IDF design >>>>>>> goals). >>>>>>> >>>>>>> Gwen >>>>>>> >>>>>>> On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]> >>>>>> wrote: >>>>>>>> Hey guys, >>>>>>>> >>>>>>>> Any thoughts on where configurable NULL values should be? Either the >>>>>> IDF or >>>>>>>> HDFS connector? >>>>>>>> >>>>>>>> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >>>>>>>> >>>>>>>> -Abe >>>>>> >>>>>> >>>> >
