Just curious,
What solution did you use?
Sent from my iPhone
On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:
Thanks all of you guys.
Best Regards,
Jumping Qu
------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and
under
budget.)
On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman
<[email protected]> wrote:
In this case, why wouldn't you simply use globbing in your load
statements?
Somethign like
baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);
Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';
On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux
<[email protected]
wrote:
Actually I was using another loader and I just tried with
PigStorage (Pig
0.6) and it seems to work too.
If your input file has two columns this will have the expected
schema and
data:
A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);
A: {f1: chararray,f2: chararray,filename: chararray}
If you do "tuple.set(tuple.getLength() - 1, fileName)" your third
column
will be null.
So in practice the loader loads the data "independently" and then
"casts"
it
to the schema you provided. After yes, I don't say that it is a very
clean
solution.
Thanks,
Romain
2010/3/2 Mridul Muralidharan <[email protected]>
I am not sure if this will work as you expect.
Depending on which implementation of PigStorage you end up using,
it
might exhibit different behavior.
If I am not wrong, currently, for example, if you specify something
like
:
A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
fileName:chararray);
your code will end up generating a tuple of 4 fields - the fileName
always being 'null' and the actual filename you inserted through
MyLoader ending up being the 4th field (and so not 'seen' by pig
- not
sure what happens if you do a join, etc with this tuple though !
Essentially runtime is not consistent with script schema).
Note - this is an implementation specific behavior, which could
probably
have been fixed by implementation specific hack
"tuple.set(tuple.getLength() - 1, fileName)" [if you know
fileName is
the last field expected].
As expected, it is brittle code.
From a while back, I remember facing issues with pig's implicit
conversion to/from bytearray, its implicit project which was
introduced,
insertion of null's to extend to schema specified (the above
behavior),
etc.
So you would become dependent on the impl changes.
I dont think BinStorage and PigStorage have been written with
inheritance in mind ...
Regards,
Mridul
On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
Hi,
In Pig 0.6 you can extend the PigStorage and grab the name of the
file
with
something like this:
@Override
public void bindTo(String fileName,
BufferedPositionedInputStream
is,
long
offset, long end)
throws IOException {
super.bindTo(fileName, is, offset, end);
this.fileName = fileName; // In your case match with a regexp
and
get
the group with the name only (e.g. google, baidu)
}
@Override
public Tuple getNext() throws IOException {
Tuple next = super.getNext();
if (next != null) {
next.append(fileName);
}
return next;
}
Then you can group on the name and split on it.
Thanks,
Romain
On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>
wrote:
Hi,
Could pig recognize files name are importing ? If could, how to
do ?
I
want
to combine them according filename.
Exp:
google_2009_12_21.csv, google_2010_01_21.csv,
google_2010_02_21.csv,
baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
....
Sort and combine by name, then output two files: google_all.csv,
baidu_all.csv in a pig script.
Best Regards,
Jumping Qu
------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and
under
budget.)
--
Zaki Rahaman