Even if you're using amazon elastic mapreduce you can specify additional named parameters when running scripts. You can specify variable placeholders in your script and then pass them on the console. Or specify defaults. Or you can always run your scripts in interactive mode so you have complete control over execution. And you can always hardcode when all else fails

Sent from my iPhone

On Mar 3, 2010, at 8:45 PM, Jumping <[email protected]> wrote:

I am using MapReduce on Amazon, there is another problem, like as how to
use two "$INPUT" parameters in a pig script.

Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)


On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <[email protected]> wrote:

Just curious,

What solution did you use?

Sent from my iPhone


On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:

Thanks all of you guys.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)


On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <[email protected]>
wrote:

In this case, why wouldn't you simply use globbing in your load
statements?
Somethign like

baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);

Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';

On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]

wrote:


Actually I was using another loader and I just tried with PigStorage
(Pig
0.6) and it seems to work too.

If your input file has two columns this will have the expected schema
and
data:

A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);

A: {f1: chararray,f2: chararray,filename: chararray}

If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
will be null.

So in practice the loader loads the data "independently" and then
"casts"
it
to the schema you provided. After yes, I don't say that it is a very

clean

solution.

Thanks,

Romain

2010/3/2 Mridul Muralidharan <[email protected]>


I am not sure if this will work as you expect.
Depending on which implementation of PigStorage you end up using, it
might exhibit different behavior.

If I am not wrong, currently, for example, if you specify something

like

:


A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
fileName:chararray);


your code will end up generating a tuple of 4 fields - the fileName
always being 'null' and the actual filename you inserted through
MyLoader ending up being the 4th field (and so not 'seen' by pig - not
sure what happens if you do a join, etc with this tuple though !
Essentially runtime is not consistent with script schema).


Note - this is an implementation specific behavior, which could

probably

have been fixed by implementation specific hack
"tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
the last field expected].

As expected, it is brittle code.


From a while back, I remember facing issues with pig's implicit
conversion to/from bytearray, its implicit project which was

introduced,

insertion of null's to extend to schema specified (the above behavior),
etc.
So you would become dependent on the impl changes.


I dont think BinStorage and PigStorage have been written with
inheritance in mind ...


Regards,
Mridul





On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:

Hi,

In Pig 0.6 you can extend the PigStorage and grab the name of the

file

with

something like this:

@Override
public void bindTo(String fileName, BufferedPositionedInputStream

is,

long

offset, long end)
    throws IOException {
  super.bindTo(fileName, is, offset, end);

  this.fileName = fileName; // In your case match with a regexp

and

get

the group with the name only (e.g. google, baidu)
}

@Override
public Tuple getNext() throws IOException {
  Tuple next = super.getNext();

  if (next != null) {
    next.append(fileName);
  }

  return next;
}

Then you can group on the name and split on it.

Thanks,

Romain

On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>

wrote:


Hi,
Could pig recognize files name are importing ? If could, how to do ?

I

want

to combine them according filename.

Exp:
google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv, baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,

....


Sort and combine by name, then output two files: google_all.csv,
baidu_all.csv  in a pig script.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and

under

budget.)







--
Zaki Rahaman


Reply via email to