Yes, that should work. I will use InputFormat.getNext from the SampleLoader
to skip the records.
On 11/3/09 6:39 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
> We definitely want to avoid parsing every tuple when sampling. But do
> we need to implement a special function for it? Pig will have access
> to the InputFormat instance, correct? Can it not call
> InputFormat.getNext the desired number of times (which will not parse
> the tuple) and then call LoadFunc.getNext to get the next parsed tuple?
> On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote:
>> In the new implementation of SampleLoader subclasses (used by order-
>> skew-join ..) as part of the loader redesign, we are not only
>> reading all
>> the records input but also parsing them as pig tuples.
>> This is because the SampleLoaders are wrappers around the actual input
>> loaders specified in the query. We can make things much faster by
>> having a
>> skipNext() function (or skipNext(int numSkip) ) which will avoid
>> parsing the
>> record into a pig tuple.
>> LoadFunc could optionally implement this (easy to implement)
>> function (which
>> will be part of an interface) for improving speed of queries such as