Even if you're using amazon elastic mapreduce you can specify
additional named parameters when running scripts. You can specify
variable placeholders in your script and then pass them on the
console. Or specify defaults. Or you can always run your scripts in
interactive mode so you have complete control over execution. And you
can always hardcode when all else fails
Sent from my iPhone
On Mar 3, 2010, at 8:45 PM, Jumping <[email protected]> wrote:
I am using MapReduce on Amazon, there is another problem, like as
how to
use two "$INPUT" parameters in a pig script.
Best Regards,
Jumping Qu
------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and
under
budget.)
On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman
<[email protected]> wrote:
Just curious,
What solution did you use?
Sent from my iPhone
On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:
Thanks all of you guys.
Best Regards,
Jumping Qu
------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and
under
budget.)
On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman
<[email protected]>
wrote:
In this case, why wouldn't you simply use globbing in your load
statements?
Somethign like
baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);
Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';
On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]
wrote:
Actually I was using another loader and I just tried with
PigStorage
(Pig
0.6) and it seems to work too.
If your input file has two columns this will have the expected
schema
and
data:
A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);
A: {f1: chararray,f2: chararray,filename: chararray}
If you do "tuple.set(tuple.getLength() - 1, fileName)" your
third column
will be null.
So in practice the loader loads the data "independently" and then
"casts"
it
to the schema you provided. After yes, I don't say that it is a
very
clean
solution.
Thanks,
Romain
2010/3/2 Mridul Muralidharan <[email protected]>
I am not sure if this will work as you expect.
Depending on which implementation of PigStorage you end up
using, it
might exhibit different behavior.
If I am not wrong, currently, for example, if you specify
something
like
:
A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
fileName:chararray);
your code will end up generating a tuple of 4 fields - the
fileName
always being 'null' and the actual filename you inserted through
MyLoader ending up being the 4th field (and so not 'seen' by
pig - not
sure what happens if you do a join, etc with this tuple though !
Essentially runtime is not consistent with script schema).
Note - this is an implementation specific behavior, which could
probably
have been fixed by implementation specific hack
"tuple.set(tuple.getLength() - 1, fileName)" [if you know
fileName is
the last field expected].
As expected, it is brittle code.
From a while back, I remember facing issues with pig's implicit
conversion to/from bytearray, its implicit project which was
introduced,
insertion of null's to extend to schema specified (the above
behavior),
etc.
So you would become dependent on the impl changes.
I dont think BinStorage and PigStorage have been written with
inheritance in mind ...
Regards,
Mridul
On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
Hi,
In Pig 0.6 you can extend the PigStorage and grab the name of
the
file
with
something like this:
@Override
public void bindTo(String fileName,
BufferedPositionedInputStream
is,
long
offset, long end)
throws IOException {
super.bindTo(fileName, is, offset, end);
this.fileName = fileName; // In your case match with a regexp
and
get
the group with the name only (e.g. google, baidu)
}
@Override
public Tuple getNext() throws IOException {
Tuple next = super.getNext();
if (next != null) {
next.append(fileName);
}
return next;
}
Then you can group on the name and split on it.
Thanks,
Romain
On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>
wrote:
Hi,
Could pig recognize files name are importing ? If could, how
to do ?
I
want
to combine them according filename.
Exp:
google_2009_12_21.csv, google_2010_01_21.csv,
google_2010_02_21.csv,
baidu_2009_11_22.csv, baidu_2010_01_01.csv,
baidu_2010_02_03.csv,
....
Sort and combine by name, then output two files:
google_all.csv,
baidu_all.csv in a pig script.
Best Regards,
Jumping Qu
------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time
and
under
budget.)
--
Zaki Rahaman