At the moment the answer is "Preprocessor" I believe
On Wed, May 19, 2010 at 3:37 PM, Bill Graham <[email protected]> wrote: > Hi, > > Is there a way to read a collection (of unknown size) of tab-delimited > values into a single data type (tuple?) during the LOAD phase? > > Here's specifically what I'm looking to do. I have a given input file > format > of tab-delimited fields like so: > > [timestamp] [count] [field1] [field2] [field2] .. [fieldN] > > I'm writing a pig job to take many small files and roll up the counts for a > given time increment of a lesser granularity. For example, many files with > timestamps rounded to 5 minute intervals will be rolled into a single file > with 1 hour granularity. > > I'm able to do this by grouping on the timestamp (rounded down to the hour) > and each of the fields shown if I know the number of fields and I list them > all explicitly. I'd like to write this script though that would work on > different input formats, some which might have N fields, where others have > M. For a given job run, the number of fields in the input files passed > would > be fixed. > > So I'd like to be able to do something like this in pseudo code: > > LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line) > ... > GROUP BY round_hour(timestamp), rest_of_line > [flatten group and sum counts] > ... > STORE round_hour(timestamp), totalCount, rest_of_line > > Where I know nothing about how many tokens are in next_of_line. Any ideas > besides subclassing PigStorage or writing a new FileInputLoadFunc? > > thanks, > Bill >
