Alan Gates
Tue, 02 Feb 2010 16:25:24 -0800
Answers inlined: On Feb 2, 2010, at 3:15 AM, Guy Jeffery wrote:
Hi, Hope this gets to the right list...I'm fairly new to Pig, been playing around with it for a couple of days.Essentially I'm doing a bit of work to evaluate Pig and its ability tosimplify the use of Hadoop - basically to allow users without a massive Java background to run Hadoop jobs. There's a couple of issues I've got- which are probably very simple, and even more probably documented somewhere, but I can't find it. First, I'm using Pig 0.5.0 and Hadoop 0.20. 1. Dynamically assigning a variable - I can use %declare my_count '0' as int; But if I want to set this dynamically? e.g. A = GROUP srtdx ALL; B = FOREACH A GENERATE COUNT(srtdx);I want to set this value B (which is a long) to a variable - how do I doit, or is it not possible? None of the following seem to work. %declare my_count B as int %declare my_count 'B' as int; %declare my_count `B` as int; %declare my_count ` FOREACH A GENERATE COUNT(srtdx)` as int;
Pig Latin is a dataflow language, not a traditional procedural programming language. It does not support variable declaration. %declare does not declare a variable; it is part of the pre-processor (somewhat analogous to #define in C). The variables on the left side of a Pig Latin script are relations (that is, collections of records), not scalar values.
2. Is it possible to alter the datatype of a element in a tuple? e.g. A = LOAD 'my_file' as (c1:chararray, c2:int); B = FOREACH A GENERATE c1*2; throws an error.
Cast the value, so A = load 'my_file' as (c1: chararray, c2:int); B = foreach a generate (int)c1 * 2;should work. (We only added casts from chararray to int recently, so this particular cast may not be in the release you're using.)
3. Picking a specific row from a bag - is there a SQL-like 'rownum' operator? If I have a bag of 50 elements can I do something like...? C = FILTER B BY (rownum <= 10);
No. Since Pig executes in parallel with task partitioning done at runtime it is not generally possible to give rownums. You can build a UDF that generates unique row ids, but they will not be ordered across different maps.
Alan.