[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Pradeep Kamath (JIRA) Mon, 06 Jul 2009 16:41:41 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727834#action_12727834
 ]


Pradeep Kamath commented on PIG-820:
------------------------------------

Review comments - two observations:

1. In PigStorage the skip() implementation should do an extra skip(1) if the 
byte at n-1 is not -1 (i.e. after skipping n-1, if the stream is not at EOF, 
there should be a skip(1), so that n bytes are skipped in all).

2. The comment for  getSampledTuple() contains:
{code}
   /** 
        * Get the next sampled tuple from the stream. 
        * Those loaders which can appropriately return the next tuple after 
        * skipping in the stream(e.g. BinStorage) can in turn call their 
getNext()
        * for implementing this method. Those who cannot (e.g. PigStorage) need 
to
        * provide their own implementation.
        * Samplers must call this method to get next tuple and should never 
directly call
        * underlying loader's getNext() method.
        * @return the next tuple after skipping or null if there are no more 
tuples
        * to be processed.
        */

{code}

The comment can be updated to be explicit about the context in which 
getSampledTuple() would be called- something along the lines of 
{noformat}
getSampledTuple() method will be called after a call to skip(). Hence the 
loader implementation would have to handle the case wherein the current read 
position 
in the stream is not at the beginning of a record and correctly give the next 
tuple starting from the current read position. In particular, the 
implementation would need to handle the following cases:
1) The current read position for the input stream is at the beginning of the 
stream - in this case getSampledTuple() should return the tuple repesenting the 
first tuple in the stream
2) The current read position for the input stream is in the middle of a record 
- in this case getSampledTuple() should return the tuple representing the next 
record by reading forward in the stream
3) The current read position for the input stream is exactly at the beginning 
of a record - in this case getSampledTuple() should return the tuple 
representing the record at current read position
4) The current read position for the input stream is beyond end of file - in 
this case getSampledTuple() should return null
{noformat}

To keep the comment from being very verbose, the implementation details 
(whether to delegate to getNext() or not) can be omitted.


> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.4.0
>
>         Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
> pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {...@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

Reply via email to