> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java,
> > line 157
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line157>
> >
> > In the case where they do not give it a schema, I think that they
> > shouldn't have to actually specify that it is a map. It's implicit.
> > Furthermore, I think that instead of passing the Schema as an argument, if
> > they do an "as (schema)" THEN it should attempt to return a Schema in
> > accordance with what you provide.
> >
> > Open to argument, though.
>
> Taku Miyakawa wrote:
> > In the case where they do not give it a schema, I think that they
> shouldn't have to actually specify that it is a map.
>
> Yes ":map[]" is redundant. I will remove it.
>
> > if they do an "as (schema)" THEN it should attempt to return a Schema
> in accordance with what you provide.
>
> Does this code meet your point?
>
> <code>
> access = LOAD 'access.log' USING
> org.apache.pig.piggybank.storage.LTSVLoader() AS (host:chararray,
> ua:chararray);
> ua = FOREACH access GENERATE ua;
> dump ua;
> </code>
>
> It will be good, but I could not find a way to implement the function in
> that manner. LoadFunc seems to have no method which can get an input schema
> specified by users.
>
> Cheolsoo Park wrote:
> >> It will be good, but I could not find a way to implement the function
> in that manner. LoadFunc seems to have no method which can get an input
> schema specified by users.
>
> You don't need access to the schema given by the AS clause inside
> LoadFunc. You can just make LoadMetaData.getSchema() return null and load
> every field as bytearray in LoadFunc.getNext(). Then, Pig inserts a FOREACH
> operator after LOAD and does type conversion using the schema given by the AS
> clause.
>
> This is what happens when PigStorage is not given a JSON schema file, but
> the output schema is defined in the AS clause.
>
> Jonathan Coveney wrote:
> Cheolsoo is totally correct. That said, this should probably be
> documented better. Also, the fact that we rely on an implicit foreach is so
> gross... but this is how you do it!
I think the function is similar to JsonLoader, it also takes a schema as an
argument.
An LTSV file itself contains field names in each row. A column consists of a
name and a value (<name>:<value>). And users may want to select fields by
their names. That is why the loader function should know field names, and it
takes a schema as an argument.
Let me explain in detail.
In an LTSV file, orders of columns may differ for each row (although it is not
a common case). And a row may lack some columns. In this example, the second
row lacks "ua" column, and the third row has a different order of columns, but
it's OK.
{data}
host:pc.example.com req:GET /index.html ua:Opera/9.80
host:user.example.net req:GET /favicon.ico
ua:Mozilla/5.0 req:GET /news.html host:workstation.example.org
{/data}
To load "host" column and "ua" column into Pig fields, the loader function
should know their names, because it cannot depend on positions of columns.
{code}
access = LOAD 'access.log' USING LTSVLoader('host, ua');
DUMP access;
-- Output:
-- (pc.example.com, Opera/9.80)
-- (user.example.net,)
-- (workstation.example.org, Mozilla/5.0)
{/code}
> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java,
> > line 576
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line576>
> >
> > Why doesn't this support pushProjection as well?
It was my sloth. I will implement it.
> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java,
> > line 547
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line547>
> >
> > It is probably useful to LOG.debug (or LOG.warn) once for each label
> > that comes up that isn't used. Right now it would be silent, and people
> > might want to know.
LOG.debug for the first occurrence of a unused column seems good. I will
implement in that way.
- Taku
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9685/#review17239
-----------------------------------------------------------
On Feb. 28, 2013, 10:53 p.m., Taku Miyakawa wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9685/
> -----------------------------------------------------------
>
> (Updated Feb. 28, 2013, 10:53 p.m.)
>
>
> Review request for pig.
>
>
> Description
> -------
>
> This is a review board for https://issues.apache.org/jira/browse/PIG-3215
>
> The patch adds LTSVLoader function and its test class.
>
>
> This addresses bug PIG-3215.
> https://issues.apache.org/jira/browse/PIG-3215
>
>
> Diffs
> -----
>
>
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
> PRE-CREATION
>
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestLTSVLoader.java
> PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9685/diff/
>
>
> Testing
> -------
>
> ant compile-test
> cd contrib/piggybank/java/
> ant -Dtestcase=TestLTSVLoader test
>
>
> Thanks,
>
> Taku Miyakawa
>
>