Re: Configurable NULL in IDF or Connector?

Abraham Elmahrek Mon, 01 Dec 2014 11:53:43 -0800

My understanding is that MySQL and PostgreSQL can output to CSV in the
suggested format.


NOTE: getTextData() and setTextData() APIs are effectively useless if
reduced processing load is not possible.

On Mon, Dec 1, 2014 at 11:42 AM, Gwen Shapira <[email protected]> wrote:

> (hijacking the thread a bit for a related point)
>
> I have some misgivings around how we manage the IDF now.
>
> We go with a pretty specific CSV in order to avoid extra-processing
> for MySQL/Postgres direct connectors.
> I think the intent is to allow running LOAD DATA  without any processing.
> Therefore we need to research and document the specific formats
> required by MySQL and Postgres. Both DBs have pretty specific (and
> often funky) formatting they need (If escaping is not used then NULL
> is null, otherwise \N...)
>
> If zero-processing load is not feasible, I'd re-consider the IDF and
> lean toward a more structured format (Avro?).  If the connectors need
> to parse the CSV and modify it, we are not gaining anything here. Or
> at the very least benchmark to validate that CSV+processing is still
> the fastest / least CPU option.
>
> Gwen
>
>
> On Mon, Dec 1, 2014 at 11:26 AM, Abraham Elmahrek <[email protected]>
> wrote:
> > Indeed. I created SQOOP-1678 is intended to address #1. Let me re-define
> > it...
> >
> > Also, for #2... There are a few ways of generating output. It seems NULL
> > values range from "\N" to 0x0 to "NULL". I think keeping NULL makes
> sense.
> >
> > On Mon, Dec 1, 2014 at 10:58 AM, Jarek Jarcec Cecho <[email protected]>
> > wrote:
> >
> >> I do share the same point of view as Gwen. The CSV format for UDF is
> very
> >> strict so that we have minimal surface area for inconsistencies between
> >> multiple connectors. This is because the IDF is an agreed upon exchange
> >> format when transferring data from one connector to the other. That
> however
> >> shouldn't stop one connector (such as HDFS) to offer ways to save the
> >> resulting CSV differently.
> >>
> >> We had similar discussion about separator and quote characters in
> >> SQOOP-1522 that seems to be relevant to the NULL discussion here.
> >>
> >> Jarcec
> >>
> >> > On Dec 1, 2014, at 10:42 AM, Gwen Shapira <[email protected]>
> wrote:
> >> >
> >> > I think its two different things:
> >> >
> >> > 1. HDFS connector should give more control over the formatting of the
> >> > data in text files (nulls, escaping, etc)
> >> > 2. IDF should give NULLs in a format that is optimized for
> >> > MySQL/Postgres direct connectors (since thats one of the IDF design
> >> > goals).
> >> >
> >> > Gwen
> >> >
> >> > On Mon, Dec 1, 2014 at 9:52 AM, Abraham Elmahrek <[email protected]>
> >> wrote:
> >> >> Hey guys,
> >> >>
> >> >> Any thoughts on where configurable NULL values should be? Either the
> >> IDF or
> >> >> HDFS connector?
> >> >>
> >> >> cf: https://issues.apache.org/jira/browse/SQOOP-1678
> >> >>
> >> >> -Abe
> >>
> >>
>

Re: Configurable NULL in IDF or Connector?

Reply via email to