I am not sure what the processing is once the group'ing is done, but each tuple has a size() (for arity) method which gives us the number of fields in that tuple [if using in udf].
So that can be used to aid in computation.


If you are interested in aggregating and simply storing it - you dont really need to know the arity of a tuple, right ? (That is, group by timestamp, and store - PigStorage should continue to store the variable number of fields as was present in input).



Regards,
Mridul

On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:
Thanks Mridul, but how would I access the items in the numbered fields
3..N where I don't know what N is? Are you suggesting I pass A to a
custom UDF to convert to a tuple of [time, count, rest_of_line]?


On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
<[email protected] <mailto:[email protected]>> wrote:


    You can simply skip specifying schema in the load - and access the
    fields either through the udf or through $0, etc positional indexes.


    Like :

    A = load 'myfile' USING PigStorage();
    B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
    C = ...



    Regards,
    Mridul


    On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:

        Hi,

        Is there a way to read a collection (of unknown size) of
        tab-delimited
        values into a single data type (tuple?) during the LOAD phase?

        Here's specifically what I'm looking to do. I have a given input
        file format
        of tab-delimited fields like so:

        [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

        I'm writing a pig job to take many small files and roll up the
        counts for a
        given time increment of a lesser granularity. For example, many
        files with
        timestamps rounded to 5 minute intervals will be rolled into a
        single file
        with 1 hour granularity.

        I'm able to do this by grouping on the timestamp (rounded down
        to the hour)
        and each of the fields shown if I know the number of fields and
        I list them
        all explicitly.  I'd like to write this script though that would
        work on
        different input formats, some which might have N fields, where
        others have
        M. For a given job run, the number of fields in the input files
        passed would
        be fixed.

        So I'd like to be able to do something like this in pseudo code:

        LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
        ...
        GROUP BY round_hour(timestamp), rest_of_line
        [flatten group and sum counts]
        ...
        STORE round_hour(timestamp), totalCount, rest_of_line

        Where I know nothing about how many tokens are in next_of_line.
        Any ideas
        besides subclassing PigStorage or writing a new FileInputLoadFunc?

        thanks,
        Bill




Reply via email to