At the moment the answer is "Preprocessor" I believe

On Wed, May 19, 2010 at 3:37 PM, Bill Graham <[email protected]> wrote:

> Hi,
>
> Is there a way to read a collection (of unknown size) of tab-delimited
> values into a single data type (tuple?) during the LOAD phase?
>
> Here's specifically what I'm looking to do. I have a given input file
> format
> of tab-delimited fields like so:
>
> [timestamp] [count] [field1] [field2] [field2] .. [fieldN]
>
> I'm writing a pig job to take many small files and roll up the counts for a
> given time increment of a lesser granularity. For example, many files with
> timestamps rounded to 5 minute intervals will be rolled into a single file
> with 1 hour granularity.
>
> I'm able to do this by grouping on the timestamp (rounded down to the hour)
> and each of the fields shown if I know the number of fields and I list them
> all explicitly.  I'd like to write this script though that would work on
> different input formats, some which might have N fields, where others have
> M. For a given job run, the number of fields in the input files passed
> would
> be fixed.
>
> So I'd like to be able to do something like this in pseudo code:
>
> LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
> ...
> GROUP BY round_hour(timestamp), rest_of_line
> [flatten group and sum counts]
> ...
> STORE round_hour(timestamp), totalCount, rest_of_line
>
> Where I know nothing about how many tokens are in next_of_line. Any ideas
> besides subclassing PigStorage or writing a new FileInputLoadFunc?
>
> thanks,
> Bill
>

Reply via email to