You can simply skip specifying schema in the load - and access the fields either through the udf or through $0, etc positional indexes.


Like :

A = load 'myfile' USING PigStorage();
B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
C = ...



Regards,
Mridul

On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:
Hi,

Is there a way to read a collection (of unknown size) of tab-delimited
values into a single data type (tuple?) during the LOAD phase?

Here's specifically what I'm looking to do. I have a given input file format
of tab-delimited fields like so:

[timestamp] [count] [field1] [field2] [field2] .. [fieldN]

I'm writing a pig job to take many small files and roll up the counts for a
given time increment of a lesser granularity. For example, many files with
timestamps rounded to 5 minute intervals will be rolled into a single file
with 1 hour granularity.

I'm able to do this by grouping on the timestamp (rounded down to the hour)
and each of the fields shown if I know the number of fields and I list them
all explicitly.  I'd like to write this script though that would work on
different input formats, some which might have N fields, where others have
M. For a given job run, the number of fields in the input files passed would
be fixed.

So I'd like to be able to do something like this in pseudo code:

LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
...
GROUP BY round_hour(timestamp), rest_of_line
[flatten group and sum counts]
...
STORE round_hour(timestamp), totalCount, rest_of_line

Where I know nothing about how many tokens are in next_of_line. Any ideas
besides subclassing PigStorage or writing a new FileInputLoadFunc?

thanks,
Bill

Reply via email to