Is the globbing feature making it into the AvroStorage rewrite?

Russell Jurney twitter.com/rjurney


On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[email protected]> wrote:

> To answer myself again, I compiled Pig 0.11 and Piggybank, and it's working 
> very well now, globbing seems to be fully supported!
>
> Bart Verwilst schreef op 26.11.2012 15:33:
>> To answer myself, could this be part of the solution? :
>>
>> https://issues.apache.org/jira/browse/PIG-2492
>>
>> Guess I'll have to wait for 0.11 then?
>>
>> Bart Verwilst schreef op 26.11.2012 14:19:
>>> 14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>
>>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> avro = load '/test/*' USING AvroStorage();
>>> describe avro;
>>>
>>> 14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>>> Schema for avro unknown.
>>>
>>> 14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
>>>
>>> 14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>
>>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> avro = load '/test/2012-11-25.avro' USING AvroStorage();
>>> describe avro;
>>>
>>> 14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
>>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
>>> (id: long,value: chararray,pkey: chararray)}}
>>>
>>> 14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
>>> Found 1 items
>>> -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13 
>>> /test/2012-11-25.avro
>>>
>>> Cheolsoo Park schreef op 26.11.2012 10:45:
>>>> Hi,
>>>>
>>>>>> Invalid field projection. Projected field [tracetype] does not exist.
>>>>
>>>> The error indicates that the "tracetype" doesn't exist in the Pig schema of
>>>> the relation "avro". What AvroStorage does is to automatically convert Avro
>>>> schema to Pig schema during the load. Although you have "tracetype" in your
>>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for
>>>> whatever reason.
>>>>
>>>> Can you please try to "describe avro"? You can replace group and dump
>>>> commands with describe in your Pig script. This will show you what the Pig
>>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to find
>>>> out why it doesn't. It could be because the schema of .avro files is not
>>>> the same or because there is a bug in AvroStorage, etc.
>>>>
>>>>>> Maybe globbing with [] doesnt work, but wildcard works?
>>>>
>>>> You're right. AvroStorage internally uses Hadoop path globing, and Hadoop
>>>> path globing doesn't support '[ ]'. But the above error (Projected field
>>>> [tracetype] does not exist) is not because of this. URISyntaxException is
>>>> what you will get because of '[ ]'.
>>>>
>>>> Thanks,
>>>> Cheolsoo
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[email protected]> wrote:
>>>>
>>>>> Just tried this:
>>>>>
>>>>>
>>>>> ------------------------------**----------------------
>>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
>>>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>>>
>>>>> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage();
>>>>>
>>>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING AvroStorage();
>>>>>
>>>>> groups = group avro by tracetype;
>>>>>
>>>>> dump groups;
>>>>> ------------------------------**----------------------
>>>>>
>>>>> gave me:
>>>>>
>>>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>>>> Projected field [tracetype] does not exist.
>>>>>
>>>>> Pig Stack Trace
>>>>> ---------------
>>>>> ERROR 1025:
>>>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>>>> Projected field [tracetype] does not exist.
>>>>>
>>>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066:
>>>>> Unable to open iterator for alias groups
>>>>>        at org.apache.pig.PigServer.**openIterator(PigServer.java:**862)
>>>>>        at org.apache.pig.tools.grunt.**GruntParser.processDump(**
>>>>> GruntParser.java:682)
>>>>>        at org.apache.pig.tools.**pigscript.parser.**
>>>>> PigScriptParser.parse(**PigScriptParser.java:303)
>>>>>        at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(**
>>>>> GruntParser.java:189)
>>>>>        at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(**
>>>>> GruntParser.java:165)
>>>>>        at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84)
>>>>>        at org.apache.pig.Main.run(Main.**java:555)
>>>>>        at org.apache.pig.Main.main(Main.**java:111)
>>>>>        at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>>>>>        at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>>>>> NativeMethodAccessorImpl.java:**39)
>>>>>        at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>>>>> DelegatingMethodAccessorImpl.**java:25)
>>>>>        at java.lang.reflect.Method.**invoke(Method.java:597)
>>>>>        at org.apache.hadoop.util.RunJar.**main(RunJar.java:208)
>>>>> Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias
>>>>> groups
>>>>>        at org.apache.pig.PigServer.**storeEx(PigServer.java:961)
>>>>>        at org.apache.pig.PigServer.**store(PigServer.java:924)
>>>>>        at org.apache.pig.PigServer.**openIterator(PigServer.java:**837)
>>>>>        ... 12 more
>>>>> Caused by: org.apache.pig.impl.plan.**PlanValidationException: ERROR 1025:
>>>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>>>> Projected field [tracetype] does not exist.
>>>>>        at org.apache.pig.newplan.**logical.expression.**
>>>>> ProjectExpression.findColNum(**ProjectExpression.java:183)
>>>>>        at org.apache.pig.newplan.**logical.expression.**
>>>>> ProjectExpression.**setColumnNumberFromAlias(**ProjectExpression.java:166)
>>>>>        at org.apache.pig.newplan.**logical.visitor.**
>>>>> ColumnAliasConversionVisitor$**1.visit(**ColumnAliasConversionVisitor.**
>>>>> java:53)
>>>>>        at org.apache.pig.newplan.**logical.expression.**
>>>>> ProjectExpression.accept(**ProjectExpression.java:207)
>>>>>        at org.apache.pig.newplan.**DependencyOrderWalker.walk(**
>>>>> DependencyOrderWalker.java:75)
>>>>>        at org.apache.pig.newplan.**PlanVisitor.visit(PlanVisitor.**
>>>>> java:50)
>>>>>        at org.apache.pig.newplan.**logical.optimizer.**
>>>>> AllExpressionVisitor.visit(**AllExpressionVisitor.java:101)
>>>>>        at org.apache.pig.newplan.**logical.relational.LOCogroup.**
>>>>> accept(LOCogroup.java:235)
>>>>>        at org.apache.pig.newplan.**DependencyOrderWalker.walk(**
>>>>> DependencyOrderWalker.java:75)
>>>>>        at org.apache.pig.newplan.**PlanVisitor.visit(PlanVisitor.**
>>>>> java:50)
>>>>>        at org.apache.pig.PigServer$**Graph.compile(PigServer.java:**1621)
>>>>>        at org.apache.pig.PigServer$**Graph.compile(PigServer.java:**1616)
>>>>>        at org.apache.pig.PigServer$**Graph.access$200(PigServer.**
>>>>> java:1339)
>>>>>        at org.apache.pig.PigServer.**storeEx(PigServer.java:956)
>>>>>        ... 14 more
>>>>> ==============================**==============================**
>>>>> ====================
>>>>>
>>>>>
>>>>> Maybe globbing with [] doesnt work, but wildcard works? No idea why i get
>>>>> the error above though..
>>>>>
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Bart
>>>>>
>>>>> Cheolsoo Park schreef op 25.11.2012 15:33:
>>>>>
>>>>>> Hi Bart,
>>>>>>
>>>>>> avro = load '/data/2012/trace_ejb3/2012-****01-*.avro' USING
>>>>>> AvroStorage();
>>>>>> gives me:
>>>>>> Schema for avro unknown.
>>>>>>
>>>>>> This should work. The error that you're getting is not from AvroStorage
>>>>>> but
>>>>>> PigServer.
>>>>>>
>>>>>> grep -r "Schema for .* unknown" *
>>>>>> src/org/apache/pig/PigServer.**java:
>>>>>> System.out.println("Schema for " + alias + " unknown.");
>>>>>> ...
>>>>>>
>>>>>> It looks like that you have an error in your Pig script. Can you please
>>>>>> provide your Pig script and the schema of your avro files that reproduce
>>>>>> the error?
>>>>>>
>>>>>> Thanks,
>>>>>> Cheolsoo
>>>>>>
>>>>>>
>>>>>> On Sun, Nov 25, 2012 at 1:02 AM, Bart Verwilst <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I've tried loading a csv with PigStorage(), getting this:
>>>>>>>
>>>>>>>
>>>>>>> txt = load '/import.mysql/trace_ejb3_****2011/part-m-00000' USING
>>>>>>>
>>>>>>> PigStorage(',');
>>>>>>> describe txt;
>>>>>>>
>>>>>>> Schema for txt unknown.
>>>>>>>
>>>>>>> Maybe this is because of it being a csv, so a schema is hard to figure
>>>>>>> out..
>>>>>>>
>>>>>>> Any other suggestions? Our whole hadoop setup is built around being able
>>>>>>> to selectively load avro files to run our jobs on, if this doesn't work
>>>>>>> then we're pretty much screwed.. :)
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Bart
>>>>>>>
>>>>>>> Russell Jurney schreef op 24.11.2012 20:23:
>>>>>>>
>>>>>>> I suspect the problem is AvroStorage, not globbing. Try this with
>>>>>>>
>>>>>>>> pigstorage.
>>>>>>>>
>>>>>>>> Russell Jurney twitter.com/rjurney
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for your suggestion!
>>>>>>>>> I switch my avro variable to avro = load '$INPUT' USING AvroStorage();
>>>>>>>>>
>>>>>>>>> However I get the same results this way:
>>>>>>>>>
>>>>>>>>> $ pig -p INPUT=/data/2012/trace_ejb3/****2012-01-02.avro avro-test.pig
>>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/****
>>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/****
>>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin)
>>>>>>>>> <snip>
>>>>>>>>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
>>>>>>>>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>>>>>>>>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
>>>>>>>>> (id:
>>>>>>>>> long,value: chararray,pkey: chararray)}}
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> $ pig -p INPUT="/data/2012/trace_ejb3/****2012-01-0[12].avro"
>>>>>>>>> avro-test.pig
>>>>>>>>>
>>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/****
>>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/****
>>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin)
>>>>>>>>> <snip>
>>>>>>>>> 2012-11-24 14:11:17,309 [main] ERROR org.apache.pig.tools.grunt.****
>>>>>>>>> Grunt
>>>>>>>>>
>>>>>>>>> - ERROR 2999: Unexpected internal error. null
>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at
>>>>>>>>> index 31: /data/2012/trace_ejb3/2012-01-****0[12].avro
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> $ pig -p INPUT='/data/2012/trace_ejb3/****2012-01-0[12].avro'
>>>>>>>>> avro-test.pig
>>>>>>>>>
>>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/****
>>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/****
>>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin)
>>>>>>>>> <snip>
>>>>>>>>> 2012-11-24 14:12:05,085 [main] ERROR org.apache.pig.tools.grunt.****
>>>>>>>>> Grunt
>>>>>>>>>
>>>>>>>>> - ERROR 2999: Unexpected internal error. null
>>>>>>>>> Details at logfile: /var/lib/hadoop-hdfs/pig_****1353762722742.log
>>>>>>>>>
>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at
>>>>>>>>> index 31: /data/2012/trace_ejb3/2012-01-****0[12].avro
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Deepak Tiwari schreef op 24.11.2012 00:41:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I dont have a system to test it right now, but I have been passing it
>>>>>>>>>> using
>>>>>>>>>> under parameter -p and it works.
>>>>>>>>>>
>>>>>>>>>> change line to  accept parameters like         avro = load '$INPUT'
>>>>>>>>>> USING
>>>>>>>>>> AvroStorage();
>>>>>>>>>>
>>>>>>>>>> bin/pig -p INPUT="/data/2012/trace_ejb3/****2012-**01-0[12].avro"
>>>>>>>>>>
>>>>>>>>>> <scriptName>
>>>>>>>>>>
>>>>>>>>>> I think if you dont give double quotes then the expansion is done by
>>>>>>>>>> OS.
>>>>>>>>>>
>>>>>>>>>> Please let us know if it doesnt work...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have the following files on HDFS:
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup   22989179 2012-11-22 11:17
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******01.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  240551819 2012-11-22 14:27
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******02.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  324464635 2012-11-22 18:28
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******03.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  345526418 2012-11-22 21:30
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******04.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  351322916 2012-11-23 00:28
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******05.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  325953043 2012-11-23 04:32
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******06.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  107019156 2012-11-23 05:58
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******07.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup   46392850 2012-11-23 06:37
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******08.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  361970930 2012-11-23 10:06
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******09.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  398462505 2012-11-23 13:44
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******10.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  400785976 2012-11-23 17:17
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******11.avro
>>>>>>>>>>>
>>>>>>>>>>> -rw-r--r--   3 hdfs supergroup  400027565 2012-11-23 20:43
>>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******12.avro
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Using Pig 0.10.0-cdh4.1.2, i try to load those files, and describe
>>>>>>>>>>> them.
>>>>>>>>>>>
>>>>>>>>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>>>>>>>>>> REGISTER 'hdfs:///lib/json-simple-1.1.******1.jar';
>>>>>>>>>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>>>>>>>>>
>>>>>>>>>>> DEFINE AvroStorage org.apache.pig.piggybank.****
>>>>>>>>>>> storage.avro.AvroStorage();
>>>>>>>>>>>
>>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-01.avro' USING
>>>>>>>>>>>
>>>>>>>>>>> AvroStorage();
>>>>>>>>>>>
>>>>>>>>>>> describe avro;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This works, same with 2012-01-02.avro.
>>>>>>>>>>>
>>>>>>>>>>> However, as soon as i want to include multiple files, no dice.
>>>>>>>>>>>
>>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-{01,02}.avro' USING
>>>>>>>>>>> AvroStorage();
>>>>>>>>>>> gives me:
>>>>>>>>>>> 2012-11-23 21:41:07,475 [main] ERROR org.apache.pig.tools.grunt.****
>>>>>>>>>>> **Grunt
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> ERROR 2999: Unexpected internal error. null
>>>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at
>>>>>>>>>>> index
>>>>>>>>>>> 30: /data/2012/trace_ejb3/2012-01-******{01,02}.avro
>>>>>>>>>>>
>>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-*.avro' USING
>>>>>>>>>>>
>>>>>>>>>>> AvroStorage();
>>>>>>>>>>> gives me:
>>>>>>>>>>> Schema for avro unknown.
>>>>>>>>>>>
>>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-0[12].avro' USING
>>>>>>>>>>>
>>>>>>>>>>> AvroStorage();
>>>>>>>>>>> also gives me:
>>>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at
>>>>>>>>>>> index
>>>>>>>>>>> 31: /data/2012/trace_ejb3/2012-01-******0[12].avro
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What am i doing wrong here? According to
>>>>>>>>>>> http://hadoop.apache.org/docs/******<http://hadoop.apache.org/docs/****><
>>>>>>>>>>> http://hadoop.apache.org/**docs/**<http://hadoop.apache.org/docs/**>
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>> r0.21.0/api/org/apache/hadoop/******fs/FileSystem.html#****
>>>>>>>>>>> globStatus%**
>>>>>>>>>>>
>>>>>>>>>>> 28org.apache.hadoop.fs.Path%****29<http://hadoop.apache.org/**
>>>>>>>>>>> docs/r0.21.0/api/org/apache/****hadoop/fs/FileSystem.html#**
>>>>>>>>>>>
>>>>>>>>>>> globStatus%28org.apache.****hadoop.fs.Path%29<http://**
>>>>>>>>>>> hadoop.apache.org/docs/r0.21.**0/api/org/apache/hadoop/fs/**
>>>>>>>>>>> FileSystem.html#globStatus%**28org.apache.hadoop.fs.Path%29<http://hadoop.apache.org/docs/r0.21.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29>
>>>>>>>>>>> **>>,
>>>>>>>>>>>
>>>>>>>>>>> this should all be acceptable input?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>>
>>>>>>>>>>> Kind regards,
>>>>>>>>>>>
>>>>>>>>>>> Bart
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to