Thanks Mridul, but how would I access the items in the numbered fields 3..N
where I don't know what N is? Are you suggesting I pass A to a custom UDF to
convert to a tuple of [time, count, rest_of_line]?


On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
<[email protected]>wrote:

>
> You can simply skip specifying schema in the load - and access the fields
> either through the udf or through $0, etc positional indexes.
>
>
> Like :
>
> A = load 'myfile' USING PigStorage();
> B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
> C = ...
>
>
>
> Regards,
> Mridul
>
>
> On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:
>
>> Hi,
>>
>> Is there a way to read a collection (of unknown size) of tab-delimited
>> values into a single data type (tuple?) during the LOAD phase?
>>
>> Here's specifically what I'm looking to do. I have a given input file
>> format
>> of tab-delimited fields like so:
>>
>> [timestamp] [count] [field1] [field2] [field2] .. [fieldN]
>>
>> I'm writing a pig job to take many small files and roll up the counts for
>> a
>> given time increment of a lesser granularity. For example, many files with
>> timestamps rounded to 5 minute intervals will be rolled into a single file
>> with 1 hour granularity.
>>
>> I'm able to do this by grouping on the timestamp (rounded down to the
>> hour)
>> and each of the fields shown if I know the number of fields and I list
>> them
>> all explicitly.  I'd like to write this script though that would work on
>> different input formats, some which might have N fields, where others have
>> M. For a given job run, the number of fields in the input files passed
>> would
>> be fixed.
>>
>> So I'd like to be able to do something like this in pseudo code:
>>
>> LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
>> ...
>> GROUP BY round_hour(timestamp), rest_of_line
>> [flatten group and sum counts]
>> ...
>> STORE round_hour(timestamp), totalCount, rest_of_line
>>
>> Where I know nothing about how many tokens are in next_of_line. Any ideas
>> besides subclassing PigStorage or writing a new FileInputLoadFunc?
>>
>> thanks,
>> Bill
>>
>
>

Reply via email to