> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java,
> >  line 157
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line157>
> >
> >     In the case where they do not give it a schema, I think that they 
> > shouldn't have to actually specify that it is a map. It's implicit. 
> > Furthermore, I think that instead of passing the Schema as an argument, if 
> > they do an "as (schema)" THEN it should attempt to return a Schema in 
> > accordance with what you provide.
> >     
> >     Open to argument, though.
> 
> Taku Miyakawa wrote:
>     > In the case where they do not give it a schema, I think that they 
> shouldn't have to actually specify that it is a map.
>     
>     Yes ":map[]" is redundant. I will remove it.
>     
>     > if they do an "as (schema)" THEN it should attempt to return a Schema 
> in accordance with what you provide.
>     
>     Does this code meet your point?
>     
>     <code>
>     access = LOAD 'access.log' USING 
> org.apache.pig.piggybank.storage.LTSVLoader() AS (host:chararray, 
> ua:chararray);
>     ua = FOREACH access GENERATE ua;
>     dump ua;
>     </code>
>     
>     It will be good, but I could not find a way to implement the function in 
> that manner. LoadFunc seems to have no method which can get an input schema 
> specified by users.
> 
> Cheolsoo Park wrote:
>     >> It will be good, but I could not find a way to implement the function 
> in that manner. LoadFunc seems to have no method which can get an input 
> schema specified by users.
>     
>     You don't need access to the schema given by the AS clause inside 
> LoadFunc. You can just make LoadMetaData.getSchema() return null and load 
> every field as bytearray in LoadFunc.getNext(). Then, Pig inserts a FOREACH 
> operator after LOAD and does type conversion using the schema given by the AS 
> clause.
>     
>     This is what happens when PigStorage is not given a JSON schema file, but 
> the output schema is defined in the AS clause.
> 
> Jonathan Coveney wrote:
>     Cheolsoo is totally correct. That said, this should probably be 
> documented better. Also, the fact that we rely on an implicit foreach is so 
> gross... but this is how you do it!

I think the function is similar to JsonLoader, it also takes a schema as an 
argument.

An LTSV file itself contains field names in each row.  A column consists of a 
name and a value (<name>:<value>).  And users may want to select fields by 
their names.  That is why the loader function should know field names, and it 
takes a schema as an argument.

Let me explain in detail.

In an LTSV file, orders of columns may differ for each row (although it is not 
a common case).  And a row may lack some columns.  In this example, the second 
row lacks "ua" column, and the third row has a different order of columns, but 
it's OK.

{data}
host:pc.example.com     req:GET /index.html     ua:Opera/9.80
host:user.example.net   req:GET /favicon.ico
ua:Mozilla/5.0  req:GET /news.html      host:workstation.example.org
{/data}

To load "host" column and "ua" column into Pig fields, the loader function 
should know their names, because it cannot depend on positions of columns.

{code}
access = LOAD 'access.log' USING LTSVLoader('host, ua');
DUMP access;

-- Output:
-- (pc.example.com, Opera/9.80)
-- (user.example.net,)
-- (workstation.example.org, Mozilla/5.0)
{/code}


> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java,
> >  line 576
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line576>
> >
> >     Why doesn't this support pushProjection as well?

It was my sloth. I will implement it.


> On March 1, 2013, 2:54 p.m., Jonathan Coveney wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java,
> >  line 547
> > <https://reviews.apache.org/r/9685/diff/1/?file=263710#file263710line547>
> >
> >     It is probably useful to LOG.debug (or LOG.warn) once for each label 
> > that comes up that isn't used. Right now it would be silent, and people 
> > might want to know.

LOG.debug for the first occurrence of a unused column seems good. I will 
implement in that way.


- Taku


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9685/#review17239
-----------------------------------------------------------


On Feb. 28, 2013, 10:53 p.m., Taku Miyakawa wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9685/
> -----------------------------------------------------------
> 
> (Updated Feb. 28, 2013, 10:53 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> -------
> 
> This is a review board for  https://issues.apache.org/jira/browse/PIG-3215
> 
> The patch adds LTSVLoader function and its test class.
> 
> 
> This addresses bug PIG-3215.
>     https://issues.apache.org/jira/browse/PIG-3215
> 
> 
> Diffs
> -----
> 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
>  PRE-CREATION 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestLTSVLoader.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9685/diff/
> 
> 
> Testing
> -------
> 
> ant compile-test
> cd contrib/piggybank/java/
> ant -Dtestcase=TestLTSVLoader test
> 
> 
> Thanks,
> 
> Taku Miyakawa
> 
>

Reply via email to