[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722706#action_12722706
 ] 

Ashutosh Chauhan commented on PIG-820:
--------------------------------------

In the patch RandomSampleLoader is marked as serializable and loader field in 
it is marked as transient. Since loader is  initialized in constructor and is 
used later on findbugs is complaining : "This class contains a field that is 
updated at multiple places in the class, thus it seems to be part of the state 
of the class.However, since the field is marked as transient and not set in 
readObject or readResolve, it will contain the default value in any 
deserialized instance of the class. " However there is no need for 
RandomSampleLoader to implement Serializable anyway (and thus loader to be 
marked as transient) because loader is reconstructed from FunSpec later on. 
Because of this reason, both PigStorage and BinStorage also doesnt implement 
serializable. Will be submitting a new patch with the required changes.


> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -----------------------------------------------------------------------------------------
>
>                 Key: PIG-820
>                 URL: https://issues.apache.org/jira/browse/PIG-820
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.3.0, 0.4.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: pig-820.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
>     
>     /**
>      * Skip ahead in the input stream.
>      * @param n number of bytes to skip
>      * @return number of bytes actually skipped.  The return semantics are
>      * exactly the same as {...@link java.io.InpuStream#skip(long)}
>      */
>     public long skip(long n) throws IOException;
>     
>     /**
>      * Get the current position in the stream.
>      * @return position in the stream.
>      */
>     public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to