I am not sure what the processing is once the group'ing is done, but
each tuple has a size() (for arity) method which gives us the number of
fields in that tuple [if using in udf].
So that can be used to aid in computation.
If you are interested in aggregating and simply storing it - you dont
really need to know the arity of a tuple, right ? (That is, group by
timestamp, and store - PigStorage should continue to store the variable
number of fields as was present in input).
Regards,
Mridul
On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:
Thanks Mridul, but how would I access the items in the numbered fields
3..N where I don't know what N is? Are you suggesting I pass A to a
custom UDF to convert to a tuple of [time, count, rest_of_line]?
On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
<[email protected] <mailto:[email protected]>> wrote:
You can simply skip specifying schema in the load - and access the
fields either through the udf or through $0, etc positional indexes.
Like :
A = load 'myfile' USING PigStorage();
B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
C = ...
Regards,
Mridul
On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:
Hi,
Is there a way to read a collection (of unknown size) of
tab-delimited
values into a single data type (tuple?) during the LOAD phase?
Here's specifically what I'm looking to do. I have a given input
file format
of tab-delimited fields like so:
[timestamp] [count] [field1] [field2] [field2] .. [fieldN]
I'm writing a pig job to take many small files and roll up the
counts for a
given time increment of a lesser granularity. For example, many
files with
timestamps rounded to 5 minute intervals will be rolled into a
single file
with 1 hour granularity.
I'm able to do this by grouping on the timestamp (rounded down
to the hour)
and each of the fields shown if I know the number of fields and
I list them
all explicitly. I'd like to write this script though that would
work on
different input formats, some which might have N fields, where
others have
M. For a given job run, the number of fields in the input files
passed would
be fixed.
So I'd like to be able to do something like this in pseudo code:
LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
...
GROUP BY round_hour(timestamp), rest_of_line
[flatten group and sum counts]
...
STORE round_hour(timestamp), totalCount, rest_of_line
Where I know nothing about how many tokens are in next_of_line.
Any ideas
besides subclassing PigStorage or writing a new FileInputLoadFunc?
thanks,
Bill