Just curious,

What solution did you use?

Sent from my iPhone

On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:

Thanks all of you guys.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)


On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <[email protected]> wrote:

In this case, why wouldn't you simply use globbing in your load statements?
Somethign like

baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);

Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';

On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]
wrote:

Actually I was using another loader and I just tried with PigStorage (Pig
0.6) and it seems to work too.

If your input file has two columns this will have the expected schema and
data:

A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);

A: {f1: chararray,f2: chararray,filename: chararray}

If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
will be null.

So in practice the loader loads the data "independently" and then "casts"
it
to the schema you provided. After yes, I don't say that it is a very
clean
solution.

Thanks,

Romain

2010/3/2 Mridul Muralidharan <[email protected]>


I am not sure if this will work as you expect.
Depending on which implementation of PigStorage you end up using, it
might exhibit different behavior.

If I am not wrong, currently, for example, if you specify something
like
:

A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
fileName:chararray);


your code will end up generating a tuple of 4 fields - the fileName
always being 'null' and the actual filename you inserted through
MyLoader ending up being the 4th field (and so not 'seen' by pig - not
sure what happens if you do a join, etc with this tuple though !
Essentially runtime is not consistent with script schema).


Note - this is an implementation specific behavior, which could
probably
have been fixed by implementation specific hack
"tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
the last field expected].

As expected, it is brittle code.


From a while back, I remember facing issues with pig's implicit
conversion to/from bytearray, its implicit project which was
introduced,
insertion of null's to extend to schema specified (the above behavior),
etc.
So you would become dependent on the impl changes.


I dont think BinStorage and PigStorage have been written with
inheritance in mind ...


Regards,
Mridul





On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
Hi,

In Pig 0.6 you can extend the PigStorage and grab the name of the
file
with
something like this:

  @Override
public void bindTo(String fileName, BufferedPositionedInputStream
is,
long
offset, long end)
      throws IOException {
    super.bindTo(fileName, is, offset, end);

    this.fileName = fileName; // In your case match with a regexp
and
get
the group with the name only (e.g. google, baidu)
  }

  @Override
  public Tuple getNext() throws IOException {
    Tuple next = super.getNext();

    if (next != null) {
      next.append(fileName);
    }

    return next;
  }

Then you can group on the name and split on it.

Thanks,

Romain

On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>
wrote:

Hi,
Could pig recognize files name are importing ? If could, how to do ?
I
want
to combine them according filename.

Exp:
google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
....

Sort and combine by name, then output two files:  google_all.csv,
baidu_all.csv  in a pig script.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and
under
budget.)







--
Zaki Rahaman

Reply via email to