Hmm, I am not sure if you can do this without a udf - you want to replace fields off a tuple while leaving rest of it intact. If you were simply adding to it, it would have been possible [ <new_field, *> style approach.].


Regards,
Mridul


On Thursday 20 May 2010 10:35 PM, Bill Graham wrote:
Correct, I don't need to know the arity of the tuple and if I LOAD
without specifying the fields like you show I should be able to
effectively STORE the same data.  The problem though is that I need to
include both the tuple and the timestamp in the grouping (but no the
count), then sum the counts.

As an example, this:

1271201400000   3      1770    162     5
1271201400000   4      2000    162     100
1271201700000   3      1770    162     5
1271201700000   4      2000    162     100

Would become this (where 1271199600000 is the hour that the two
timestamps both roll up to):

1271199600000   6      1770    162     5
1271199600000   8      2000    162     100

So in my case I'd like to be able to load timetamps, count and tuple and
then group on timestamp and tuple and output in the same format of
timestamp, count, tuple.

The easiest hack I've come up with for now is to dynamically insert the
field definitions in my script before I run it. So in the example above
I would insert 'f1, f2, f3' everywhere I need to reference the tuple.
Another run might insert 'f1, f2' for an input that only has 2 extra fields.


On Thu, May 20, 2010 at 12:39 AM, Mridul Muralidharan
<[email protected] <mailto:[email protected]>> wrote:



    I am not sure what the processing is once the group'ing is done, but
    each tuple has a size() (for arity) method which gives us the number
    of fields in that tuple [if using in udf].
    So that can be used to aid in computation.


    If you are interested in aggregating and simply storing it - you
    dont really need to know the arity of a tuple, right ? (That is,
    group by timestamp, and store - PigStorage should continue to store
    the variable number of fields as was present in input).



    Regards,
    Mridul


    On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:

        Thanks Mridul, but how would I access the items in the numbered
        fields
        3..N where I don't know what N is? Are you suggesting I pass A to a
        custom UDF to convert to a tuple of [time, count, rest_of_line]?


        On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
        wrote:


            You can simply skip specifying schema in the load - and
        access the
            fields either through the udf or through $0, etc positional
        indexes.


            Like :

            A = load 'myfile' USING PigStorage();
            B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
            C = ...



            Regards,
            Mridul


            On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:

                Hi,

                Is there a way to read a collection (of unknown size) of
                tab-delimited
                values into a single data type (tuple?) during the LOAD
        phase?

                Here's specifically what I'm looking to do. I have a
        given input
                file format
                of tab-delimited fields like so:

                [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

                I'm writing a pig job to take many small files and roll
        up the
                counts for a
                given time increment of a lesser granularity. For
        example, many
                files with
                timestamps rounded to 5 minute intervals will be rolled
        into a
                single file
                with 1 hour granularity.

                I'm able to do this by grouping on the timestamp
        (rounded down
                to the hour)
                and each of the fields shown if I know the number of
        fields and
                I list them
                all explicitly.  I'd like to write this script though
        that would
                work on
                different input formats, some which might have N fields,
        where
                others have
                M. For a given job run, the number of fields in the
        input files
                passed would
                be fixed.

                So I'd like to be able to do something like this in
        pseudo code:

                LOAD USING PigStorage('\t') AS (timestamp, count,
        rest_of_line)
                ...
                GROUP BY round_hour(timestamp), rest_of_line
                [flatten group and sum counts]
                ...
                STORE round_hour(timestamp), totalCount, rest_of_line

                Where I know nothing about how many tokens are in
        next_of_line.
                Any ideas
                besides subclassing PigStorage or writing a new
        FileInputLoadFunc?

                thanks,
                Bill






Reply via email to