Thanks Mridul, but how would I access the items in the numbered fields 3..N where I don't know what N is? Are you suggesting I pass A to a custom UDF to convert to a tuple of [time, count, rest_of_line]?
On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan <[email protected]>wrote: > > You can simply skip specifying schema in the load - and access the fields > either through the udf or through $0, etc positional indexes. > > > Like : > > A = load 'myfile' USING PigStorage(); > B = GROUP A by round_hour($0) PARALLEL $PARALLELISM; > C = ... > > > > Regards, > Mridul > > > On Thursday 20 May 2010 04:07 AM, Bill Graham wrote: > >> Hi, >> >> Is there a way to read a collection (of unknown size) of tab-delimited >> values into a single data type (tuple?) during the LOAD phase? >> >> Here's specifically what I'm looking to do. I have a given input file >> format >> of tab-delimited fields like so: >> >> [timestamp] [count] [field1] [field2] [field2] .. [fieldN] >> >> I'm writing a pig job to take many small files and roll up the counts for >> a >> given time increment of a lesser granularity. For example, many files with >> timestamps rounded to 5 minute intervals will be rolled into a single file >> with 1 hour granularity. >> >> I'm able to do this by grouping on the timestamp (rounded down to the >> hour) >> and each of the fields shown if I know the number of fields and I list >> them >> all explicitly. I'd like to write this script though that would work on >> different input formats, some which might have N fields, where others have >> M. For a given job run, the number of fields in the input files passed >> would >> be fixed. >> >> So I'd like to be able to do something like this in pseudo code: >> >> LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line) >> ... >> GROUP BY round_hour(timestamp), rest_of_line >> [flatten group and sum counts] >> ... >> STORE round_hour(timestamp), totalCount, rest_of_line >> >> Where I know nothing about how many tokens are in next_of_line. Any ideas >> besides subclassing PigStorage or writing a new FileInputLoadFunc? >> >> thanks, >> Bill >> > >
