Re: Could identify file name？

Jumping Wed, 03 Mar 2010 17:45:46 -0800

I am using MapReduce on Amazon,  there is another problem, like as how to
use two "$INPUT" parameters in a pig script.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)


On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <[email protected]> wrote:

> Just curious,
>
> What solution did you use?
>
> Sent from my iPhone
>
>
> On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:
>
>  Thanks all of you guys.
>>
>>
>> Best Regards,
>> Jumping Qu
>>
>> ------
>> Don't tell me how many enemies we have, but where they are!
>> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
>> budget.)
>>
>>
>> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <[email protected]>
>> wrote:
>>
>>  In this case, why wouldn't you simply use globbing in your load
>>> statements?
>>> Somethign like
>>>
>>> baidu = LOAD 'input/path/*baidu*' AS (schema);
>>> google = LOAD 'input/path/*google*' AS (schema);
>>>
>>> Store baidu INTO 'output/path/baidu_all';
>>> Store google INTO 'output/path/google_all';
>>>
>>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]
>>>
>>>> wrote:
>>>>
>>>
>>>  Actually I was using another loader and I just tried with PigStorage
>>>> (Pig
>>>> 0.6) and it seems to work too.
>>>>
>>>> If your input file has two columns this will have the expected schema
>>>> and
>>>> data:
>>>>
>>>> A = load 'file' USING MyLoader() AS (f1:chararray,
>>>> f2:chararray, fileName:chararray);
>>>>
>>>> A: {f1: chararray,f2: chararray,filename: chararray}
>>>>
>>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
>>>> will be null.
>>>>
>>>> So in practice the loader loads the data "independently" and then
>>>> "casts"
>>>> it
>>>> to the schema you provided. After yes, I don't say that it is a very
>>>>
>>> clean
>>>
>>>> solution.
>>>>
>>>> Thanks,
>>>>
>>>> Romain
>>>>
>>>> 2010/3/2 Mridul Muralidharan <[email protected]>
>>>>
>>>>
>>>>> I am not sure if this will work as you expect.
>>>>> Depending on which implementation of PigStorage you end up using, it
>>>>> might exhibit different behavior.
>>>>>
>>>>> If I am not wrong, currently, for example, if you specify something
>>>>>
>>>> like
>>>
>>>> :
>>>>
>>>>>
>>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
>>>>> fileName:chararray);
>>>>>
>>>>>
>>>>> your code will end up generating a tuple of 4 fields - the fileName
>>>>> always being 'null' and the actual filename you inserted through
>>>>> MyLoader ending up being the 4th field (and so not 'seen' by pig - not
>>>>> sure what happens if you do a join, etc with this tuple though !
>>>>> Essentially runtime is not consistent with script schema).
>>>>>
>>>>>
>>>>> Note - this is an implementation specific behavior, which could
>>>>>
>>>> probably
>>>
>>>> have been fixed by implementation specific hack
>>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
>>>>> the last field expected].
>>>>>
>>>>> As expected, it is brittle code.
>>>>>
>>>>>
>>>>> From a while back, I remember facing issues with pig's implicit
>>>>> conversion to/from bytearray, its implicit project which was
>>>>>
>>>> introduced,
>>>
>>>> insertion of null's to extend to schema specified (the above behavior),
>>>>> etc.
>>>>> So you would become dependent on the impl changes.
>>>>>
>>>>>
>>>>> I dont think BinStorage and PigStorage have been written with
>>>>> inheritance in mind ...
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of the
>>>>>>
>>>>> file
>>>
>>>> with
>>>>>
>>>>>> something like this:
>>>>>>
>>>>>>  @Override
>>>>>>  public void bindTo(String fileName, BufferedPositionedInputStream
>>>>>>
>>>>> is,
>>>>
>>>>> long
>>>>>
>>>>>> offset, long end)
>>>>>>      throws IOException {
>>>>>>    super.bindTo(fileName, is, offset, end);
>>>>>>
>>>>>>    this.fileName = fileName; // In your case match with a regexp
>>>>>>
>>>>> and
>>>
>>>> get
>>>>>
>>>>>> the group with the name only (e.g. google, baidu)
>>>>>>  }
>>>>>>
>>>>>>  @Override
>>>>>>  public Tuple getNext() throws IOException {
>>>>>>    Tuple next = super.getNext();
>>>>>>
>>>>>>    if (next != null) {
>>>>>>      next.append(fileName);
>>>>>>    }
>>>>>>
>>>>>>    return next;
>>>>>>  }
>>>>>>
>>>>>> Then you can group on the name and split on it.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Romain
>>>>>>
>>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>
>>>>>>
>>>>> wrote:
>>>
>>>>
>>>>>>  Hi,
>>>>>>> Could pig recognize files name are importing ? If could, how to do ?
>>>>>>>
>>>>>> I
>>>
>>>> want
>>>>>
>>>>>> to combine them according filename.
>>>>>>>
>>>>>>> Exp:
>>>>>>> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
>>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
>>>>>>>
>>>>>> ....
>>>
>>>>
>>>>>>> Sort and combine by name, then output two files:  google_all.csv,
>>>>>>> baidu_all.csv  in a pig script.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Jumping Qu
>>>>>>>
>>>>>>> ------
>>>>>>> Don't tell me how many enemies we have, but where they are!
>>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and
>>>>>>>
>>>>>> under
>>>>
>>>>> budget.)
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Zaki Rahaman
>>>
>>>

Re: Could identify file name？

Reply via email to