Is the globbing feature making it into the AvroStorage rewrite? Russell Jurney twitter.com/rjurney
On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[email protected]> wrote: > To answer myself again, I compiled Pig 0.11 and Piggybank, and it's working > very well now, globbing seems to be fully supported! > > Bart Verwilst schreef op 26.11.2012 15:33: >> To answer myself, could this be part of the solution? : >> >> https://issues.apache.org/jira/browse/PIG-2492 >> >> Guess I'll have to wait for 0.11 then? >> >> Bart Verwilst schreef op 26.11.2012 14:19: >>> 14:16:08 centos6-hadoop-hishiru ~ $ cat avro-test.pig >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; >>> REGISTER 'hdfs:///lib/piggybank.jar'; >>> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); >>> avro = load '/test/*' USING AvroStorage(); >>> describe avro; >>> >>> 14:16:09 centos6-hadoop-hishiru ~ $ pig avro-test.pig >>> Schema for avro unknown. >>> >>> 14:16:17 centos6-hadoop-hishiru ~ $ vim avro-test.pig >>> >>> 14:16:25 centos6-hadoop-hishiru ~ $ cat avro-test.pig >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; >>> REGISTER 'hdfs:///lib/piggybank.jar'; >>> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); >>> avro = load '/test/2012-11-25.avro' USING AvroStorage(); >>> describe avro; >>> >>> 14:16:30 centos6-hadoop-hishiru ~ $ pig avro-test.pig >>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: >>> int,heading: int,terminalid: int,customerid: chararray,mileage: >>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: >>> (id: long,value: chararray,pkey: chararray)}} >>> >>> 14:16:55 centos6-hadoop-hishiru ~ $ hadoop fs -ls /test/ >>> Found 1 items >>> -rw-r--r-- 3 hdfs supergroup 63140500 2012-11-26 14:13 >>> /test/2012-11-25.avro >>> >>> Cheolsoo Park schreef op 26.11.2012 10:45: >>>> Hi, >>>> >>>>>> Invalid field projection. Projected field [tracetype] does not exist. >>>> >>>> The error indicates that the "tracetype" doesn't exist in the Pig schema of >>>> the relation "avro". What AvroStorage does is to automatically convert Avro >>>> schema to Pig schema during the load. Although you have "tracetype" in your >>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for >>>> whatever reason. >>>> >>>> Can you please try to "describe avro"? You can replace group and dump >>>> commands with describe in your Pig script. This will show you what the Pig >>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to find >>>> out why it doesn't. It could be because the schema of .avro files is not >>>> the same or because there is a bug in AvroStorage, etc. >>>> >>>>>> Maybe globbing with [] doesnt work, but wildcard works? >>>> >>>> You're right. AvroStorage internally uses Hadoop path globing, and Hadoop >>>> path globing doesn't support '[ ]'. But the above error (Projected field >>>> [tracetype] does not exist) is not because of this. URISyntaxException is >>>> what you will get because of '[ ]'. >>>> >>>> Thanks, >>>> Cheolsoo >>>> >>>> >>>> >>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[email protected]> wrote: >>>> >>>>> Just tried this: >>>>> >>>>> >>>>> ------------------------------**---------------------- >>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; >>>>> REGISTER 'hdfs:///lib/piggybank.jar'; >>>>> >>>>> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage(); >>>>> >>>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING AvroStorage(); >>>>> >>>>> groups = group avro by tracetype; >>>>> >>>>> dump groups; >>>>> ------------------------------**---------------------- >>>>> >>>>> gave me: >>>>> >>>>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>>>> Projected field [tracetype] does not exist. >>>>> >>>>> Pig Stack Trace >>>>> --------------- >>>>> ERROR 1025: >>>>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>>>> Projected field [tracetype] does not exist. >>>>> >>>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066: >>>>> Unable to open iterator for alias groups >>>>> at org.apache.pig.PigServer.**openIterator(PigServer.java:**862) >>>>> at org.apache.pig.tools.grunt.**GruntParser.processDump(** >>>>> GruntParser.java:682) >>>>> at org.apache.pig.tools.**pigscript.parser.** >>>>> PigScriptParser.parse(**PigScriptParser.java:303) >>>>> at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** >>>>> GruntParser.java:189) >>>>> at org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** >>>>> GruntParser.java:165) >>>>> at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84) >>>>> at org.apache.pig.Main.run(Main.**java:555) >>>>> at org.apache.pig.Main.main(Main.**java:111) >>>>> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method) >>>>> at sun.reflect.**NativeMethodAccessorImpl.**invoke(** >>>>> NativeMethodAccessorImpl.java:**39) >>>>> at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(** >>>>> DelegatingMethodAccessorImpl.**java:25) >>>>> at java.lang.reflect.Method.**invoke(Method.java:597) >>>>> at org.apache.hadoop.util.RunJar.**main(RunJar.java:208) >>>>> Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias >>>>> groups >>>>> at org.apache.pig.PigServer.**storeEx(PigServer.java:961) >>>>> at org.apache.pig.PigServer.**store(PigServer.java:924) >>>>> at org.apache.pig.PigServer.**openIterator(PigServer.java:**837) >>>>> ... 12 more >>>>> Caused by: org.apache.pig.impl.plan.**PlanValidationException: ERROR 1025: >>>>> <file avro-test.pig, line 10, column 23> Invalid field projection. >>>>> Projected field [tracetype] does not exist. >>>>> at org.apache.pig.newplan.**logical.expression.** >>>>> ProjectExpression.findColNum(**ProjectExpression.java:183) >>>>> at org.apache.pig.newplan.**logical.expression.** >>>>> ProjectExpression.**setColumnNumberFromAlias(**ProjectExpression.java:166) >>>>> at org.apache.pig.newplan.**logical.visitor.** >>>>> ColumnAliasConversionVisitor$**1.visit(**ColumnAliasConversionVisitor.** >>>>> java:53) >>>>> at org.apache.pig.newplan.**logical.expression.** >>>>> ProjectExpression.accept(**ProjectExpression.java:207) >>>>> at org.apache.pig.newplan.**DependencyOrderWalker.walk(** >>>>> DependencyOrderWalker.java:75) >>>>> at org.apache.pig.newplan.**PlanVisitor.visit(PlanVisitor.** >>>>> java:50) >>>>> at org.apache.pig.newplan.**logical.optimizer.** >>>>> AllExpressionVisitor.visit(**AllExpressionVisitor.java:101) >>>>> at org.apache.pig.newplan.**logical.relational.LOCogroup.** >>>>> accept(LOCogroup.java:235) >>>>> at org.apache.pig.newplan.**DependencyOrderWalker.walk(** >>>>> DependencyOrderWalker.java:75) >>>>> at org.apache.pig.newplan.**PlanVisitor.visit(PlanVisitor.** >>>>> java:50) >>>>> at org.apache.pig.PigServer$**Graph.compile(PigServer.java:**1621) >>>>> at org.apache.pig.PigServer$**Graph.compile(PigServer.java:**1616) >>>>> at org.apache.pig.PigServer$**Graph.access$200(PigServer.** >>>>> java:1339) >>>>> at org.apache.pig.PigServer.**storeEx(PigServer.java:956) >>>>> ... 14 more >>>>> ==============================**==============================** >>>>> ==================== >>>>> >>>>> >>>>> Maybe globbing with [] doesnt work, but wildcard works? No idea why i get >>>>> the error above though.. >>>>> >>>>> >>>>> Kind regards, >>>>> >>>>> Bart >>>>> >>>>> Cheolsoo Park schreef op 25.11.2012 15:33: >>>>> >>>>>> Hi Bart, >>>>>> >>>>>> avro = load '/data/2012/trace_ejb3/2012-****01-*.avro' USING >>>>>> AvroStorage(); >>>>>> gives me: >>>>>> Schema for avro unknown. >>>>>> >>>>>> This should work. The error that you're getting is not from AvroStorage >>>>>> but >>>>>> PigServer. >>>>>> >>>>>> grep -r "Schema for .* unknown" * >>>>>> src/org/apache/pig/PigServer.**java: >>>>>> System.out.println("Schema for " + alias + " unknown."); >>>>>> ... >>>>>> >>>>>> It looks like that you have an error in your Pig script. Can you please >>>>>> provide your Pig script and the schema of your avro files that reproduce >>>>>> the error? >>>>>> >>>>>> Thanks, >>>>>> Cheolsoo >>>>>> >>>>>> >>>>>> On Sun, Nov 25, 2012 at 1:02 AM, Bart Verwilst <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>>> >>>>>>> I've tried loading a csv with PigStorage(), getting this: >>>>>>> >>>>>>> >>>>>>> txt = load '/import.mysql/trace_ejb3_****2011/part-m-00000' USING >>>>>>> >>>>>>> PigStorage(','); >>>>>>> describe txt; >>>>>>> >>>>>>> Schema for txt unknown. >>>>>>> >>>>>>> Maybe this is because of it being a csv, so a schema is hard to figure >>>>>>> out.. >>>>>>> >>>>>>> Any other suggestions? Our whole hadoop setup is built around being able >>>>>>> to selectively load avro files to run our jobs on, if this doesn't work >>>>>>> then we're pretty much screwed.. :) >>>>>>> >>>>>>> Thanks in advance! >>>>>>> >>>>>>> Bart >>>>>>> >>>>>>> Russell Jurney schreef op 24.11.2012 20:23: >>>>>>> >>>>>>> I suspect the problem is AvroStorage, not globbing. Try this with >>>>>>> >>>>>>>> pigstorage. >>>>>>>> >>>>>>>> Russell Jurney twitter.com/rjurney >>>>>>>> >>>>>>>> >>>>>>>> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[email protected]> wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks for your suggestion! >>>>>>>>> I switch my avro variable to avro = load '$INPUT' USING AvroStorage(); >>>>>>>>> >>>>>>>>> However I get the same results this way: >>>>>>>>> >>>>>>>>> $ pig -p INPUT=/data/2012/trace_ejb3/****2012-01-02.avro avro-test.pig >>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**** >>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/**** >>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin) >>>>>>>>> <snip> >>>>>>>>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: >>>>>>>>> int,heading: int,terminalid: int,customerid: chararray,mileage: >>>>>>>>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: >>>>>>>>> (id: >>>>>>>>> long,value: chararray,pkey: chararray)}} >>>>>>>>> >>>>>>>>> >>>>>>>>> $ pig -p INPUT="/data/2012/trace_ejb3/****2012-01-0[12].avro" >>>>>>>>> avro-test.pig >>>>>>>>> >>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**** >>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/**** >>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin) >>>>>>>>> <snip> >>>>>>>>> 2012-11-24 14:11:17,309 [main] ERROR org.apache.pig.tools.grunt.**** >>>>>>>>> Grunt >>>>>>>>> >>>>>>>>> - ERROR 2999: Unexpected internal error. null >>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at >>>>>>>>> index 31: /data/2012/trace_ejb3/2012-01-****0[12].avro >>>>>>>>> >>>>>>>>> >>>>>>>>> $ pig -p INPUT='/data/2012/trace_ejb3/****2012-01-0[12].avro' >>>>>>>>> avro-test.pig >>>>>>>>> >>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**** >>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/**** >>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin) >>>>>>>>> <snip> >>>>>>>>> 2012-11-24 14:12:05,085 [main] ERROR org.apache.pig.tools.grunt.**** >>>>>>>>> Grunt >>>>>>>>> >>>>>>>>> - ERROR 2999: Unexpected internal error. null >>>>>>>>> Details at logfile: /var/lib/hadoop-hdfs/pig_****1353762722742.log >>>>>>>>> >>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at >>>>>>>>> index 31: /data/2012/trace_ejb3/2012-01-****0[12].avro >>>>>>>>> >>>>>>>>> >>>>>>>>> Deepak Tiwari schreef op 24.11.2012 00:41: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I dont have a system to test it right now, but I have been passing it >>>>>>>>>> using >>>>>>>>>> under parameter -p and it works. >>>>>>>>>> >>>>>>>>>> change line to accept parameters like avro = load '$INPUT' >>>>>>>>>> USING >>>>>>>>>> AvroStorage(); >>>>>>>>>> >>>>>>>>>> bin/pig -p INPUT="/data/2012/trace_ejb3/****2012-**01-0[12].avro" >>>>>>>>>> >>>>>>>>>> <scriptName> >>>>>>>>>> >>>>>>>>>> I think if you dont give double quotes then the expansion is done by >>>>>>>>>> OS. >>>>>>>>>> >>>>>>>>>> Please let us know if it doesnt work... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I have the following files on HDFS: >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******01.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******02.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 324464635 2012-11-22 18:28 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******03.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 345526418 2012-11-22 21:30 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******04.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 351322916 2012-11-23 00:28 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******05.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 325953043 2012-11-23 04:32 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******06.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 107019156 2012-11-23 05:58 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******07.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 46392850 2012-11-23 06:37 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******08.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 361970930 2012-11-23 10:06 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******09.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 398462505 2012-11-23 13:44 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******10.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 400785976 2012-11-23 17:17 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******11.avro >>>>>>>>>>> >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 400027565 2012-11-23 20:43 >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******12.avro >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Using Pig 0.10.0-cdh4.1.2, i try to load those files, and describe >>>>>>>>>>> them. >>>>>>>>>>> >>>>>>>>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; >>>>>>>>>>> REGISTER 'hdfs:///lib/json-simple-1.1.******1.jar'; >>>>>>>>>>> REGISTER 'hdfs:///lib/piggybank.jar'; >>>>>>>>>>> >>>>>>>>>>> DEFINE AvroStorage org.apache.pig.piggybank.**** >>>>>>>>>>> storage.avro.AvroStorage(); >>>>>>>>>>> >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-01.avro' USING >>>>>>>>>>> >>>>>>>>>>> AvroStorage(); >>>>>>>>>>> >>>>>>>>>>> describe avro; >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This works, same with 2012-01-02.avro. >>>>>>>>>>> >>>>>>>>>>> However, as soon as i want to include multiple files, no dice. >>>>>>>>>>> >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-{01,02}.avro' USING >>>>>>>>>>> AvroStorage(); >>>>>>>>>>> gives me: >>>>>>>>>>> 2012-11-23 21:41:07,475 [main] ERROR org.apache.pig.tools.grunt.**** >>>>>>>>>>> **Grunt >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> ERROR 2999: Unexpected internal error. null >>>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at >>>>>>>>>>> index >>>>>>>>>>> 30: /data/2012/trace_ejb3/2012-01-******{01,02}.avro >>>>>>>>>>> >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-*.avro' USING >>>>>>>>>>> >>>>>>>>>>> AvroStorage(); >>>>>>>>>>> gives me: >>>>>>>>>>> Schema for avro unknown. >>>>>>>>>>> >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-0[12].avro' USING >>>>>>>>>>> >>>>>>>>>>> AvroStorage(); >>>>>>>>>>> also gives me: >>>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in path at >>>>>>>>>>> index >>>>>>>>>>> 31: /data/2012/trace_ejb3/2012-01-******0[12].avro >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> What am i doing wrong here? According to >>>>>>>>>>> http://hadoop.apache.org/docs/******<http://hadoop.apache.org/docs/****>< >>>>>>>>>>> http://hadoop.apache.org/**docs/**<http://hadoop.apache.org/docs/**> >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> r0.21.0/api/org/apache/hadoop/******fs/FileSystem.html#**** >>>>>>>>>>> globStatus%** >>>>>>>>>>> >>>>>>>>>>> 28org.apache.hadoop.fs.Path%****29<http://hadoop.apache.org/** >>>>>>>>>>> docs/r0.21.0/api/org/apache/****hadoop/fs/FileSystem.html#** >>>>>>>>>>> >>>>>>>>>>> globStatus%28org.apache.****hadoop.fs.Path%29<http://** >>>>>>>>>>> hadoop.apache.org/docs/r0.21.**0/api/org/apache/hadoop/fs/** >>>>>>>>>>> FileSystem.html#globStatus%**28org.apache.hadoop.fs.Path%29<http://hadoop.apache.org/docs/r0.21.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29> >>>>>>>>>>> **>>, >>>>>>>>>>> >>>>>>>>>>> this should all be acceptable input? >>>>>>>>>>> >>>>>>>>>>> Thanks in advance! >>>>>>>>>>> >>>>>>>>>>> Kind regards, >>>>>>>>>>> >>>>>>>>>>> Bart >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>
