(hijacking the thread a bit for a related point) I have some misgivings around how we manage the IDF now.
We go with a pretty specific CSV in order to avoid extra-processing for MySQL/Postgres direct connectors. I think the intent is to allow running LOAD DATA without any processing. Therefore we need to research and document the specific formats required by MySQL and Postgres. Both DBs have pretty specific (and often funky) formatting they need (If escaping is not used then NULL is null, otherwise \N...) If zero-processing load is not feasible, I'd re-consider the IDF and lean toward a more structured format (Avro?). If the connectors need to parse the CSV and modify it, we are not gaining anything here. Or at the very least benchmark to validate that CSV+processing is still the fastest / least CPU option. Gwen On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]> wrote: > Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define > it... > > Also, for #2... There are a few ways of generating output. It seems NULL > values range from "\N" to 0x0 to "NULL". I think keeping NULL makes sense. > > On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <[email protected]> > wrote: > >> I do share the same point of view as Gwen. The CSV format for UDF is very >> strict so that we have minimal surface area for inconsistencies between >> multiple connectors. This is because the IDF is an agreed upon exchange >> format when transferring data from one connector to the other. That however >> shouldn't stop one connector (such as HDFS) to offer ways to save the >> resulting CSV differently. >> >> We had similar discussion about separator and quote characters in >> SQOOP-1522 that seems to be relevant to the NULL discussion here. >> >> Jarcec >> >> > On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]> wrote: >> > >> > I think its two different things: >> > >> > 1. HDFS connector should give more control over the formatting of the >> > data in text files (nulls, escaping, etc) >> > 2. IDF should give NULLs in a format that is optimized for >> > MySQL/Postgres direct connectors (since thats one of the IDF design >> > goals). >> > >> > Gwen >> > >> > On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]> >> wrote: >> >> Hey guys, >> >> >> >> Any thoughts on where configurable NULL values should be? Either the >> IDF or >> >> HDFS connector? >> >> >> >> cf: https://issues.apache.org/jira/browse/SQOOP-1678 >> >> >> >> -Abe >> >>
