Agreed. I hope we'll have at least one direct connector real soon now to prove it.
Reading this: http://dev.mysql.com/doc/refman/5.6/en/load-data.html was a bit discouraging... On Mon, Dec 1, 2014 at 11:50 AM, Abraham Elmahrek <[email protected]> wrote: > My understanding is that MySQL and PostgreSQL can output to CSV in the > suggested format. > > NOTE: getTextData() and setTextData() APIs are effectively useless if > reduced processing load is not possible. > > On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]> wrote: > >> (hijacking the thread a bit for a related point) >> >> I have some misgivings around how we manage the IDF now. >> >> We go with a pretty specific CSV in order to avoid extra-processing >> for MySQL/Postgres direct connectors. >> I think the intent is to allow running LOAD DATA without any processing. >> Therefore we need to research and document the specific formats >> required by MySQL and Postgres. Both DBs have pretty specific (and >> often funky) formatting they need (If escaping is not used then NULL >> is null, otherwise \N...) >> >> If zero-processing load is not feasible, I'd re-consider the IDF and >> lean toward a more structured format (Avro?). If the connectors need >> to parse the CSV and modify it, we are not gaining anything here. Or >> at the very least benchmark to validate that CSV+processing is still >> the fastest / least CPU option. >> >> Gwen >> >> >> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]> >> wrote: >> > Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define >> > it... >> > >> > Also, for #2... There are a few ways of generating output. It seems NULL >> > values range from "\N" to 0x0 to "NULL". I think keeping NULL makes >> sense. >> > >> > On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <[email protected]> >> > wrote: >> > >> >> I do share the same point of view as Gwen. The CSV format for UDF is >> very >> >> strict so that we have minimal surface area for inconsistencies between >> >> multiple connectors. This is because the IDF is an agreed upon exchange >> >> format when transferring data from one connector to the other. That >> however >> >> shouldn't stop one connector (such as HDFS) to offer ways to save the >> >> resulting CSV differently. >> >> >> >> We had similar discussion about separator and quote characters in >> >> SQOOP-1522 that seems to be relevant to the NULL discussion here. >> >> >> >> Jarcec >> >> >> >> > On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]> >> wrote: >> >> > >> >> > I think its two different things: >> >> > >> >> > 1. HDFS connector should give more control over the formatting of the >> >> > data in text files (nulls, escaping, etc) >> >> > 2. IDF should give NULLs in a format that is optimized for >> >> > MySQL/Postgres direct connectors (since thats one of the IDF design >> >> > goals). >> >> > >> >> > Gwen >> >> > >> >> > On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]> >> >> wrote: >> >> >> Hey guys, >> >> >> >> >> >> Any thoughts on where configurable NULL values should be? Either the >> >> IDF or >> >> >> HDFS connector? >> >> >> >> >> >> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >> >> >> >> >> >> -Abe >> >> >> >> >>
