Re: Could identify file name？

Romain Rigaux Wed, 03 Mar 2010 22:48:17 -0800

Or you can just call the script twice with:

$INPUT= 'input/path/*baidu*'
$OUTPUT='output/path/baidu_all'


then

$INPUT= 'input/path/*google*'
$OUTPUT='output/path/google_all'

Thanks,

Romain

On Wed, Mar 3, 2010 at 5:58 PM, Zaki Rahaman <[email protected]> wrote:

> Even if you're using amazon elastic mapreduce you can specify additional
> named parameters when running scripts. You can specify variable placeholders
> in your script and then pass them on the console. Or specify defaults. Or
> you can always run your scripts in interactive mode so you have complete
> control over execution. And you can always hardcode when all else fails
>
> Sent from my iPhone
>
>
> On Mar 3, 2010, at 8:45 PM, Jumping <[email protected]> wrote:
>
>  I am using MapReduce on Amazon,  there is another problem, like as how to
>> use two "$INPUT" parameters in a pig script.
>>
>> Best Regards,
>> Jumping Qu
>>
>> ------
>> Don't tell me how many enemies we have, but where they are!
>> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
>> budget.)
>>
>>
>> On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <[email protected]>
>> wrote:
>>
>>  Just curious,
>>>
>>> What solution did you use?
>>>
>>> Sent from my iPhone
>>>
>>>
>>> On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote:
>>>
>>> Thanks all of you guys.
>>>
>>>>
>>>>
>>>> Best Regards,
>>>> Jumping Qu
>>>>
>>>> ------
>>>> Don't tell me how many enemies we have, but where they are!
>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and under
>>>> budget.)
>>>>
>>>>
>>>> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <[email protected]>
>>>> wrote:
>>>>
>>>> In this case, why wouldn't you simply use globbing in your load
>>>>
>>>>> statements?
>>>>> Somethign like
>>>>>
>>>>> baidu = LOAD 'input/path/*baidu*' AS (schema);
>>>>> google = LOAD 'input/path/*google*' AS (schema);
>>>>>
>>>>> Store baidu INTO 'output/path/baidu_all';
>>>>> Store google INTO 'output/path/google_all';
>>>>>
>>>>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]
>>>>>
>>>>>  wrote:
>>>>>>
>>>>>>
>>>>> Actually I was using another loader and I just tried with PigStorage
>>>>>
>>>>>> (Pig
>>>>>> 0.6) and it seems to work too.
>>>>>>
>>>>>> If your input file has two columns this will have the expected schema
>>>>>> and
>>>>>> data:
>>>>>>
>>>>>> A = load 'file' USING MyLoader() AS (f1:chararray,
>>>>>> f2:chararray, fileName:chararray);
>>>>>>
>>>>>> A: {f1: chararray,f2: chararray,filename: chararray}
>>>>>>
>>>>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third
>>>>>> column
>>>>>> will be null.
>>>>>>
>>>>>> So in practice the loader loads the data "independently" and then
>>>>>> "casts"
>>>>>> it
>>>>>> to the schema you provided. After yes, I don't say that it is a very
>>>>>>
>>>>>>  clean
>>>>>
>>>>>  solution.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Romain
>>>>>>
>>>>>> 2010/3/2 Mridul Muralidharan <[email protected]>
>>>>>>
>>>>>>
>>>>>>  I am not sure if this will work as you expect.
>>>>>>> Depending on which implementation of PigStorage you end up using, it
>>>>>>> might exhibit different behavior.
>>>>>>>
>>>>>>> If I am not wrong, currently, for example, if you specify something
>>>>>>>
>>>>>>>  like
>>>>>>
>>>>>
>>>>>  :
>>>>>>
>>>>>>
>>>>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
>>>>>>> fileName:chararray);
>>>>>>>
>>>>>>>
>>>>>>> your code will end up generating a tuple of 4 fields - the fileName
>>>>>>> always being 'null' and the actual filename you inserted through
>>>>>>> MyLoader ending up being the 4th field (and so not 'seen' by pig -
>>>>>>> not
>>>>>>> sure what happens if you do a join, etc with this tuple though !
>>>>>>> Essentially runtime is not consistent with script schema).
>>>>>>>
>>>>>>>
>>>>>>> Note - this is an implementation specific behavior, which could
>>>>>>>
>>>>>>>  probably
>>>>>>
>>>>>
>>>>>  have been fixed by implementation specific hack
>>>>>>
>>>>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
>>>>>>> the last field expected].
>>>>>>>
>>>>>>> As expected, it is brittle code.
>>>>>>>
>>>>>>>
>>>>>>> From a while back, I remember facing issues with pig's implicit
>>>>>>> conversion to/from bytearray, its implicit project which was
>>>>>>>
>>>>>>>  introduced,
>>>>>>
>>>>>
>>>>>  insertion of null's to extend to schema specified (the above
>>>>>> behavior),
>>>>>>
>>>>>>> etc.
>>>>>>> So you would become dependent on the impl changes.
>>>>>>>
>>>>>>>
>>>>>>> I dont think BinStorage and PigStorage have been written with
>>>>>>> inheritance in mind ...
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
>>>>>>>
>>>>>>>  Hi,
>>>>>>>>
>>>>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of the
>>>>>>>>
>>>>>>>>  file
>>>>>>>
>>>>>>
>>>>>  with
>>>>>>
>>>>>>>
>>>>>>>  something like this:
>>>>>>>>
>>>>>>>> @Override
>>>>>>>> public void bindTo(String fileName, BufferedPositionedInputStream
>>>>>>>>
>>>>>>>>  is,
>>>>>>>
>>>>>>
>>>>>>  long
>>>>>>>
>>>>>>>  offset, long end)
>>>>>>>>    throws IOException {
>>>>>>>>  super.bindTo(fileName, is, offset, end);
>>>>>>>>
>>>>>>>>  this.fileName = fileName; // In your case match with a regexp
>>>>>>>>
>>>>>>>>  and
>>>>>>>
>>>>>>
>>>>>  get
>>>>>>
>>>>>>>
>>>>>>>  the group with the name only (e.g. google, baidu)
>>>>>>>> }
>>>>>>>>
>>>>>>>> @Override
>>>>>>>> public Tuple getNext() throws IOException {
>>>>>>>>  Tuple next = super.getNext();
>>>>>>>>
>>>>>>>>  if (next != null) {
>>>>>>>>    next.append(fileName);
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  return next;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Then you can group on the name and split on it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Romain
>>>>>>>>
>>>>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>>
>>>>>>
>>>>>
>>>>>>  Hi,
>>>>>>>>
>>>>>>>>> Could pig recognize files name are importing ? If could, how to do
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>>  I
>>>>>>>>
>>>>>>>
>>>>>  want
>>>>>>
>>>>>>>
>>>>>>>  to combine them according filename.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Exp:
>>>>>>>>> google_2009_12_21.csv, google_2010_01_21.csv,
>>>>>>>>> google_2010_02_21.csv,
>>>>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
>>>>>>>>>
>>>>>>>>>  ....
>>>>>>>>
>>>>>>>
>>>>>
>>>>>>  Sort and combine by name, then output two files:  google_all.csv,
>>>>>>>>> baidu_all.csv  in a pig script.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Jumping Qu
>>>>>>>>>
>>>>>>>>> ------
>>>>>>>>> Don't tell me how many enemies we have, but where they are!
>>>>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and
>>>>>>>>>
>>>>>>>>>  under
>>>>>>>>
>>>>>>>
>>>>>>  budget.)
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Zaki Rahaman
>>>>>
>>>>>
>>>>>

Re: Could identify file name？

Reply via email to