My understanding is that MySQL and PostgreSQL can output to CSV in the suggested format.
NOTE: getTextData() and setTextData() APIs are effectively useless if reduced processing load is not possible. On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]> wrote: > (hijacking the thread a bit for a related point) > > I have some misgivings around how we manage the IDF now. > > We go with a pretty specific CSV in order to avoid extra-processing > for MySQL/Postgres direct connectors. > I think the intent is to allow running LOAD DATA without any processing. > Therefore we need to research and document the specific formats > required by MySQL and Postgres. Both DBs have pretty specific (and > often funky) formatting they need (If escaping is not used then NULL > is null, otherwise \N...) > > If zero-processing load is not feasible, I'd re-consider the IDF and > lean toward a more structured format (Avro?). If the connectors need > to parse the CSV and modify it, we are not gaining anything here. Or > at the very least benchmark to validate that CSV+processing is still > the fastest / least CPU option. > > Gwen > > > On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]> > wrote: > > Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define > > it... > > > > Also, for #2... There are a few ways of generating output. It seems NULL > > values range from "\N" to 0x0 to "NULL". I think keeping NULL makes > sense. > > > > On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <[email protected]> > > wrote: > > > >> I do share the same point of view as Gwen. The CSV format for UDF is > very > >> strict so that we have minimal surface area for inconsistencies between > >> multiple connectors. This is because the IDF is an agreed upon exchange > >> format when transferring data from one connector to the other. That > however > >> shouldn't stop one connector (such as HDFS) to offer ways to save the > >> resulting CSV differently. > >> > >> We had similar discussion about separator and quote characters in > >> SQOOP-1522 that seems to be relevant to the NULL discussion here. > >> > >> Jarcec > >> > >> > On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]> > wrote: > >> > > >> > I think its two different things: > >> > > >> > 1. HDFS connector should give more control over the formatting of the > >> > data in text files (nulls, escaping, etc) > >> > 2. IDF should give NULLs in a format that is optimized for > >> > MySQL/Postgres direct connectors (since thats one of the IDF design > >> > goals). > >> > > >> > Gwen > >> > > >> > On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]> > >> wrote: > >> >> Hey guys, > >> >> > >> >> Any thoughts on where configurable NULL values should be? Either the > >> IDF or > >> >> HDFS connector? > >> >> > >> >> cf: https://issues.apache.org/jira/browse/SQOOP-1678 > >> >> > >> >> -Abe > >> > >> >
