Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Adam Warrington Thu, 19 May 2011 10:04:22 -0700


> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java,
> >  line 205
> > <https://reviews.apache.org/r/547/diff/1/?file=14980#file14980line205>
> >
> >     please clean up whitespace :)


Oops, sorry. I'll clean that up.


> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java,
> >  line 202
> > <https://reviews.apache.org/r/547/diff/1/?file=14980#file14980line202>
> >
> >     Do we care about the specifics of how this output is written?
> >     
> >     Seems like it would be less code, and potentially better in the long 
> > run (if we are dealing with other kinds of splits) to just call toString() 
> > on the InputSplit. FileSplit already defines its own toString() which 
> > prints out the path, the start offset, and the length.
> 
> Ashutosh Chauhan wrote:
>     I agree with Dmitriy. If possible, we should avoid special casing for a 
> particular type of InputSplit. Further, InputSplit provides getLocations() 
> and getLength() api which should be used instead of FileSplit specific api.

So it seems the options are to either:

1. Use the input splits toString() method.
2. Use just getLocations and getLength, which are part of the InputSplit API.

I'm leaning towards toString, because it is going to contain useful information 
for the common case of FIleSplit which getLocations won't have, that being the 
file offset and the file name.

If this is the common consensus, I'll submit a patch with that update. Let me 
know.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review452
-----------------------------------------------------------


On 2011-05-19 16:27:22, Adam Warrington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> -----------------------------------------------------------
> 
> (Updated 2011-05-19 16:27:22)
> 
> 
> Review request for pig.
> 
> 
> Summary
> -------
> 
> This is a patch for PIG-1702, which describes an issue where the task output 
> logs for PIG streaming jobs contains null input-split information. The 
> ability to query the input-split information through the JobConf went away 
> with the new MR API. We must now gain a reference to the underlying 
> FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -----
> 
>   
> trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
>  1088692 
> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> -------
> 
> To test this, I wrote a very simple python script to pass data through using 
> PIG. After checking the task logs of the completed task, the stderr logs now 
> contain valid input split information. Below are the scripts and test data 
> used.
> 
> ### PIG commands run ###
> DEFINE testpy `test.py` SHIP ('test.py');
> raw_records = LOAD '/test.txt2'; 
> T1 = STREAM raw_records THROUGH testpy;
> dump T1;
> 
> ### test.py ###
> #!/usr/bin/python
> import sys
> 
> cnt = 0
> for line in sys.stdin:
>     print line.strip() + " " + str(cnt)
>     cnt += 1
> 
> ### contents of /test.txt on hdfs ###
> one line
> two line
> three line
> four line
> 
> 
> Thanks,
> 
> Adam
> 
>

Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.

Reply via email to