Re: Could identify file name？

Romain Rigaux Wed, 03 Mar 2010 10:22:43 -0800

Actually I was using another loader and I just tried with PigStorage (Pig
0.6) and it seems to work too.


If your input file has two columns this will have the expected schema and
data:

A = load 'file' USING MyLoader() AS (f1:chararray,
f2:chararray, fileName:chararray);

A: {f1: chararray,f2: chararray,filename: chararray}

If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
will be null.

So in practice the loader loads the data "independently" and then "casts" it
to the schema you provided. After yes, I don't say that it is a very clean
solution.

Thanks,

Romain

2010/3/2 Mridul Muralidharan <[email protected]>

>
> I am not sure if this will work as you expect.
> Depending on which implementation of PigStorage you end up using, it
> might exhibit different behavior.
>
> If I am not wrong, currently, for example, if you specify something like :
>
> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
> fileName:chararray);
>
>
> your code will end up generating a tuple of 4 fields - the fileName
> always being 'null' and the actual filename you inserted through
> MyLoader ending up being the 4th field (and so not 'seen' by pig - not
> sure what happens if you do a join, etc with this tuple though !
> Essentially runtime is not consistent with script schema).
>
>
> Note - this is an implementation specific behavior, which could probably
> have been fixed by implementation specific hack
> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
> the last field expected].
>
> As expected, it is brittle code.
>
>
> From a while back, I remember facing issues with pig's implicit
> conversion to/from bytearray, its implicit project which was introduced,
> insertion of null's to extend to schema specified (the above behavior),
> etc.
> So you would become dependent on the impl changes.
>
>
> I dont think BinStorage and PigStorage have been written with
> inheritance in mind ...
>
>
> Regards,
> Mridul
>
>
>
>
>
> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
> > Hi,
> >
> > In Pig 0.6 you can extend the PigStorage and grab the name of the file
> with
> > something like this:
> >
> >    @Override
> >    public void bindTo(String fileName, BufferedPositionedInputStream is,
> long
> > offset, long end)
> >        throws IOException {
> >      super.bindTo(fileName, is, offset, end);
> >
> >      this.fileName = fileName; // In your case match with a regexp and
> get
> > the group with the name only (e.g. google, baidu)
> >    }
> >
> >    @Override
> >    public Tuple getNext() throws IOException {
> >      Tuple next = super.getNext();
> >
> >      if (next != null) {
> >        next.append(fileName);
> >      }
> >
> >      return next;
> >    }
> >
> > Then you can group on the name and split on it.
> >
> > Thanks,
> >
> > Romain
> >
> > On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>  wrote:
> >
> >> Hi,
> >> Could pig recognize files name are importing ? If could, how to do ? I
> want
> >> to combine them according filename.
> >>
> >> Exp:
> >> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
> >> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....
> >>
> >> Sort and combine by name, then output two files:  google_all.csv,
> >> baidu_all.csv  in a pig script.
> >>
> >>
> >> Best Regards,
> >> Jumping Qu
> >>
> >> ------
> >> Don't tell me how many enemies we have, but where they are!
> >> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
> >> budget.)
> >>
>
>

Re: Could identify file name？

Reply via email to