Re: Could identify file name？

Zaki Rahaman Wed, 03 Mar 2010 17:29:14 -0800

Just curious,

What solution did you use?


Sent from my iPhone

On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:

Thanks all of you guys.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!

(ADV:Perl -- It's like Java, only it lets you deliver on time andunder

budget.)

On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman<[email protected]> wrote:

In this case, why wouldn't you simply use globbing in your loadstatements?

Somethign like

baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);

Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';

On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux<[email protected]

wrote:

Actually I was using another loader and I just tried withPigStorage (Pig
0.6) and it seems to work too.
If your input file has two columns this will have the expectedschema and
data:

A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);

A: {f1: chararray,f2: chararray,filename: chararray}
If you do "tuple.set(tuple.getLength() - 1, fileName)" your thirdcolumn
will be null.
So in practice the loader loads the data "independently" and then"casts"
it
to the schema you provided. After yes, I don't say that it is a very

clean

solution.

Thanks,

Romain

2010/3/2 Mridul Muralidharan <[email protected]>


I am not sure if this will work as you expect.

Depending on which implementation of PigStorage you end up using,it

might exhibit different behavior.

If I am not wrong, currently, for example, if you specify something

like


A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
fileName:chararray);


your code will end up generating a tuple of 4 fields - the fileName
always being 'null' and the actual filename you inserted through

MyLoader ending up being the 4th field (and so not 'seen' by pig- not

sure what happens if you do a join, etc with this tuple though !
Essentially runtime is not consistent with script schema).


Note - this is an implementation specific behavior, which could

probably

have been fixed by implementation specific hack

"tuple.set(tuple.getLength() - 1, fileName)" [if you knowfileName is

the last field expected].

As expected, it is brittle code.


From a while back, I remember facing issues with pig's implicit
conversion to/from bytearray, its implicit project which was

introduced,

insertion of null's to extend to schema specified (the abovebehavior),

etc.
So you would become dependent on the impl changes.


I dont think BinStorage and PigStorage have been written with
inheritance in mind ...


Regards,
Mridul





On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:

Hi,

In Pig 0.6 you can extend the PigStorage and grab the name of the

file

with
something like this:

  @Override
public void bindTo(String fileName,BufferedPositionedInputStream

is,

long

offset, long end)
      throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp

and

get

the group with the name only (e.g. google, baidu)
  }

  @Override
  public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
      next.append(fileName);
    }

    return next;
  }

Then you can group on the name and split on it.

Thanks,

Romain

On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>

wrote:

Hi,
Could pig recognize files name are importing ? If could, how todo ?

want
to combine them according filename.

Exp:
google_2009_12_21.csv, google_2010_01_21.csv,google_2010_02_21.csv,
baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,

....


Sort and combine by name, then output two files:  google_all.csv,
baidu_all.csv  in a pig script.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and

under

budget.)




--
Zaki Rahaman

Re: Could identify file name？

Reply via email to