Re: including multiple delimited fields (of unknown count) into one

Bill Graham Thu, 20 May 2010 10:05:42 -0700

Correct, I don't need to know the arity of the tuple and if I LOAD without
specifying the fields like you show I should be able to effectively STORE
the same data.  The problem though is that I need to include both the tuple
and the timestamp in the grouping (but no the count), then sum the counts.


As an example, this:

1271201400000   3      1770    162     5
1271201400000   4      2000    162     100
1271201700000   3      1770    162     5
1271201700000   4      2000    162     100

Would become this (where 1271199600000 is the hour that the two timestamps
both roll up to):

1271199600000   6      1770    162     5
1271199600000   8      2000    162     100

So in my case I'd like to be able to load timetamps, count and tuple and
then group on timestamp and tuple and output in the same format of
timestamp, count, tuple.

The easiest hack I've come up with for now is to dynamically insert the
field definitions in my script before I run it. So in the example above I
would insert 'f1, f2, f3' everywhere I need to reference the tuple. Another
run might insert 'f1, f2' for an input that only has 2 extra fields.


On Thu, May 20, 2010 at 12:39 AM, Mridul Muralidharan <[email protected]
> wrote:

>
>
> I am not sure what the processing is once the group'ing is done, but each
> tuple has a size() (for arity) method which gives us the number of fields in
> that tuple [if using in udf].
> So that can be used to aid in computation.
>
>
> If you are interested in aggregating and simply storing it - you dont
> really need to know the arity of a tuple, right ? (That is, group by
> timestamp, and store - PigStorage should continue to store the variable
> number of fields as was present in input).
>
>
>
> Regards,
> Mridul
>
>
> On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:
>
>> Thanks Mridul, but how would I access the items in the numbered fields
>> 3..N where I don't know what N is? Are you suggesting I pass A to a
>> custom UDF to convert to a tuple of [time, count, rest_of_line]?
>>
>>
>> On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>>
>>    You can simply skip specifying schema in the load - and access the
>>    fields either through the udf or through $0, etc positional indexes.
>>
>>
>>    Like :
>>
>>    A = load 'myfile' USING PigStorage();
>>    B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
>>    C = ...
>>
>>
>>
>>    Regards,
>>    Mridul
>>
>>
>>    On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:
>>
>>        Hi,
>>
>>        Is there a way to read a collection (of unknown size) of
>>        tab-delimited
>>        values into a single data type (tuple?) during the LOAD phase?
>>
>>        Here's specifically what I'm looking to do. I have a given input
>>        file format
>>        of tab-delimited fields like so:
>>
>>        [timestamp] [count] [field1] [field2] [field2] .. [fieldN]
>>
>>        I'm writing a pig job to take many small files and roll up the
>>        counts for a
>>        given time increment of a lesser granularity. For example, many
>>        files with
>>        timestamps rounded to 5 minute intervals will be rolled into a
>>        single file
>>        with 1 hour granularity.
>>
>>        I'm able to do this by grouping on the timestamp (rounded down
>>        to the hour)
>>        and each of the fields shown if I know the number of fields and
>>        I list them
>>        all explicitly.  I'd like to write this script though that would
>>        work on
>>        different input formats, some which might have N fields, where
>>        others have
>>        M. For a given job run, the number of fields in the input files
>>        passed would
>>        be fixed.
>>
>>        So I'd like to be able to do something like this in pseudo code:
>>
>>        LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
>>        ...
>>        GROUP BY round_hour(timestamp), rest_of_line
>>        [flatten group and sum counts]
>>        ...
>>        STORE round_hour(timestamp), totalCount, rest_of_line
>>
>>        Where I know nothing about how many tokens are in next_of_line.
>>        Any ideas
>>        besides subclassing PigStorage or writing a new FileInputLoadFunc?
>>
>>        thanks,
>>        Bill
>>
>>
>>
>>
>

Re: including multiple delimited fields (of unknown count) into one

Reply via email to