Re: Could identify file name？

Zaki Rahaman Wed, 03 Mar 2010 17:59:29 -0800

Even if you're using amazon elastic mapreduce you can specifyadditional named parameters when running scripts. You can specifyvariable placeholders in your script and then pass them on theconsole. Or specify defaults. Or you can always run your scripts ininteractive mode so you have complete control over execution. And youcan always hardcode when all else fails


Sent from my iPhone


On Mar 3, 2010, at 8:45 PM, Jumping <[email protected]> wrote:

I am using MapReduce on Amazon, there is another problem, like ashow to
use two "$INPUT" parameters in a pig script.

Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time andunder
budget.)
On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman<[email protected]> wrote:
Just curious,

What solution did you use?

Sent from my iPhone


On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:

Thanks all of you guys.
Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time andunder
budget.)
On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman<[email protected]>
wrote:

In this case, why wouldn't you simply use globbing in your load
statements?
Somethign like

baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);

Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';

On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]
wrote:
Actually I was using another loader and I just tried withPigStorage
(Pig
0.6) and it seems to work too.
If your input file has two columns this will have the expectedschema
and
data:

A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);

A: {f1: chararray,f2: chararray,filename: chararray}
If you do "tuple.set(tuple.getLength() - 1, fileName)" yourthird column
will be null.

So in practice the loader loads the data "independently" and then
"casts"
it
to the schema you provided. After yes, I don't say that it is avery
clean
solution.

Thanks,

Romain

2010/3/2 Mridul Muralidharan <[email protected]>
I am not sure if this will work as you expect.
Depending on which implementation of PigStorage you end upusing, it
might exhibit different behavior.
If I am not wrong, currently, for example, if you specifysomething
like
:
A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
fileName:chararray);
your code will end up generating a tuple of 4 fields - thefileName
always being 'null' and the actual filename you inserted through
MyLoader ending up being the 4th field (and so not 'seen' bypig - not
sure what happens if you do a join, etc with this tuple though !
Essentially runtime is not consistent with script schema).


Note - this is an implementation specific behavior, which could
probably
have been fixed by implementation specific hack
"tuple.set(tuple.getLength() - 1, fileName)" [if you knowfileName is
the last field expected].

As expected, it is brittle code.


From a while back, I remember facing issues with pig's implicit
conversion to/from bytearray, its implicit project which was
introduced,
insertion of null's to extend to schema specified (the abovebehavior),
etc.
So you would become dependent on the impl changes.


I dont think BinStorage and PigStorage have been written with
inheritance in mind ...


Regards,
Mridul





On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
Hi,
In Pig 0.6 you can extend the PigStorage and grab the name ofthe
file
with
something like this:

@Override
public void bindTo(String fileName,BufferedPositionedInputStream
is,
long
offset, long end)
    throws IOException {
  super.bindTo(fileName, is, offset, end);

  this.fileName = fileName; // In your case match with a regexp
and
get
the group with the name only (e.g. google, baidu)
}

@Override
public Tuple getNext() throws IOException {
  Tuple next = super.getNext();

  if (next != null) {
    next.append(fileName);
  }

  return next;
}

Then you can group on the name and split on it.

Thanks,

Romain

On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>
wrote:
Hi,
Could pig recognize files name are importing ? If could, howto do ?
I
want
to combine them according filename.
Exp:
google_2009_12_21.csv, google_2010_01_21.csv,google_2010_02_21.csv,baidu_2009_11_22.csv, baidu_2010_01_01.csv,baidu_2010_02_03.csv,
....
Sort and combine by name, then output two files:google_all.csv,
baidu_all.csv  in a pig script.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on timeand
under
budget.)
--
Zaki Rahaman

Re: Could identify file name？

Reply via email to