Yes, it is. Joe has unit test cases for path globbing in his patch: https://reviews.apache.org/r/8104/diff/#index_header
On Mon, Nov 26, 2012 at 8:23 AM, Russell Jurney <[email protected]>wrote: > Is the globbing feature making it into the AvroStorage rewrite? > > Russell Jurney twitter.com/rjurney > > > On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[email protected]> wrote: > > > To answer myself again, I compiled Pig 0.11 and Piggybank, and it's > working very well now, globbing seems to be fully supported! > > > > Bart Verwilst schreef op 26.11.2012 15:33: > >> To answer myself, could this be part of the solution? : > >> > >> https://issues.apache.org/jira/browse/PIG-2492 > >> > >> Guess I'll have to wait for 0.11 then? > >> > >> Bart Verwilst schreef op 26.11.2012 14:19: > >>> 14:16:08 centos6-hadoop-hishiru ~ $ cat avro-test.pig > >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; > >>> REGISTER 'hdfs:///lib/piggybank.jar'; > >>> > >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); > >>> avro = load '/test/*' USING AvroStorage(); > >>> describe avro; > >>> > >>> 14:16:09 centos6-hadoop-hishiru ~ $ pig avro-test.pig > >>> Schema for avro unknown. > >>> > >>> 14:16:17 centos6-hadoop-hishiru ~ $ vim avro-test.pig > >>> > >>> 14:16:25 centos6-hadoop-hishiru ~ $ cat avro-test.pig > >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; > >>> REGISTER 'hdfs:///lib/piggybank.jar'; > >>> > >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); > >>> avro = load '/test/2012-11-25.avro' USING AvroStorage(); > >>> describe avro; > >>> > >>> 14:16:30 centos6-hadoop-hishiru ~ $ pig avro-test.pig > >>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: > >>> int,heading: int,terminalid: int,customerid: chararray,mileage: > >>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: > >>> (id: long,value: chararray,pkey: chararray)}} > >>> > >>> 14:16:55 centos6-hadoop-hishiru ~ $ hadoop fs -ls /test/ > >>> Found 1 items > >>> -rw-r--r-- 3 hdfs supergroup 63140500 2012-11-26 14:13 > /test/2012-11-25.avro > >>> > >>> Cheolsoo Park schreef op 26.11.2012 10:45: > >>>> Hi, > >>>> > >>>>>> Invalid field projection. Projected field [tracetype] does not > exist. > >>>> > >>>> The error indicates that the "tracetype" doesn't exist in the Pig > schema of > >>>> the relation "avro". What AvroStorage does is to automatically > convert Avro > >>>> schema to Pig schema during the load. Although you have "tracetype" > in your > >>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for > >>>> whatever reason. > >>>> > >>>> Can you please try to "describe avro"? You can replace group and dump > >>>> commands with describe in your Pig script. This will show you what > the Pig > >>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to > find > >>>> out why it doesn't. It could be because the schema of .avro files is > not > >>>> the same or because there is a bug in AvroStorage, etc. > >>>> > >>>>>> Maybe globbing with [] doesnt work, but wildcard works? > >>>> > >>>> You're right. AvroStorage internally uses Hadoop path globing, and > Hadoop > >>>> path globing doesn't support '[ ]'. But the above error (Projected > field > >>>> [tracetype] does not exist) is not because of this. > URISyntaxException is > >>>> what you will get because of '[ ]'. > >>>> > >>>> Thanks, > >>>> Cheolsoo > >>>> > >>>> > >>>> > >>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[email protected]> > wrote: > >>>> > >>>>> Just tried this: > >>>>> > >>>>> > >>>>> ------------------------------**---------------------- > >>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > >>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar'; > >>>>> REGISTER 'hdfs:///lib/piggybank.jar'; > >>>>> > >>>>> DEFINE AvroStorage > org.apache.pig.piggybank.**storage.avro.AvroStorage(); > >>>>> > >>>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING > AvroStorage(); > >>>>> > >>>>> groups = group avro by tracetype; > >>>>> > >>>>> dump groups; > >>>>> ------------------------------**---------------------- > >>>>> > >>>>> gave me: > >>>>> > >>>>> <file avro-test.pig, line 10, column 23> Invalid field projection. > >>>>> Projected field [tracetype] does not exist. > >>>>> > >>>>> Pig Stack Trace > >>>>> --------------- > >>>>> ERROR 1025: > >>>>> <file avro-test.pig, line 10, column 23> Invalid field projection. > >>>>> Projected field [tracetype] does not exist. > >>>>> > >>>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066: > >>>>> Unable to open iterator for alias groups > >>>>> at > org.apache.pig.PigServer.**openIterator(PigServer.java:**862) > >>>>> at org.apache.pig.tools.grunt.**GruntParser.processDump(** > >>>>> GruntParser.java:682) > >>>>> at org.apache.pig.tools.**pigscript.parser.** > >>>>> PigScriptParser.parse(**PigScriptParser.java:303) > >>>>> at > org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** > >>>>> GruntParser.java:189) > >>>>> at > org.apache.pig.tools.grunt.**GruntParser.parseStopOnError(** > >>>>> GruntParser.java:165) > >>>>> at org.apache.pig.tools.grunt.**Grunt.exec(Grunt.java:84) > >>>>> at org.apache.pig.Main.run(Main.**java:555) > >>>>> at org.apache.pig.Main.main(Main.**java:111) > >>>>> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native > Method) > >>>>> at sun.reflect.**NativeMethodAccessorImpl.**invoke(** > >>>>> NativeMethodAccessorImpl.java:**39) > >>>>> at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(** > >>>>> DelegatingMethodAccessorImpl.**java:25) > >>>>> at java.lang.reflect.Method.**invoke(Method.java:597) > >>>>> at org.apache.hadoop.util.RunJar.**main(RunJar.java:208) > >>>>> Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store > alias > >>>>> groups > >>>>> at org.apache.pig.PigServer.**storeEx(PigServer.java:961) > >>>>> at org.apache.pig.PigServer.**store(PigServer.java:924) > >>>>> at > org.apache.pig.PigServer.**openIterator(PigServer.java:**837) > >>>>> ... 12 more > >>>>> Caused by: org.apache.pig.impl.plan.**PlanValidationException: ERROR > 1025: > >>>>> <file avro-test.pig, line 10, column 23> Invalid field projection. > >>>>> Projected field [tracetype] does not exist. > >>>>> at org.apache.pig.newplan.**logical.expression.** > >>>>> ProjectExpression.findColNum(**ProjectExpression.java:183) > >>>>> at org.apache.pig.newplan.**logical.expression.** > >>>>> > ProjectExpression.**setColumnNumberFromAlias(**ProjectExpression.java:166) > >>>>> at org.apache.pig.newplan.**logical.visitor.** > >>>>> > ColumnAliasConversionVisitor$**1.visit(**ColumnAliasConversionVisitor.** > >>>>> java:53) > >>>>> at org.apache.pig.newplan.**logical.expression.** > >>>>> ProjectExpression.accept(**ProjectExpression.java:207) > >>>>> at org.apache.pig.newplan.**DependencyOrderWalker.walk(** > >>>>> DependencyOrderWalker.java:75) > >>>>> at org.apache.pig.newplan.**PlanVisitor.visit(PlanVisitor.** > >>>>> java:50) > >>>>> at org.apache.pig.newplan.**logical.optimizer.** > >>>>> AllExpressionVisitor.visit(**AllExpressionVisitor.java:101) > >>>>> at org.apache.pig.newplan.**logical.relational.LOCogroup.** > >>>>> accept(LOCogroup.java:235) > >>>>> at org.apache.pig.newplan.**DependencyOrderWalker.walk(** > >>>>> DependencyOrderWalker.java:75) > >>>>> at org.apache.pig.newplan.**PlanVisitor.visit(PlanVisitor.** > >>>>> java:50) > >>>>> at > org.apache.pig.PigServer$**Graph.compile(PigServer.java:**1621) > >>>>> at > org.apache.pig.PigServer$**Graph.compile(PigServer.java:**1616) > >>>>> at org.apache.pig.PigServer$**Graph.access$200(PigServer.** > >>>>> java:1339) > >>>>> at org.apache.pig.PigServer.**storeEx(PigServer.java:956) > >>>>> ... 14 more > >>>>> ==============================**==============================** > >>>>> ==================== > >>>>> > >>>>> > >>>>> Maybe globbing with [] doesnt work, but wildcard works? No idea why > i get > >>>>> the error above though.. > >>>>> > >>>>> > >>>>> Kind regards, > >>>>> > >>>>> Bart > >>>>> > >>>>> Cheolsoo Park schreef op 25.11.2012 15:33: > >>>>> > >>>>>> Hi Bart, > >>>>>> > >>>>>> avro = load '/data/2012/trace_ejb3/2012-****01-*.avro' USING > >>>>>> AvroStorage(); > >>>>>> gives me: > >>>>>> Schema for avro unknown. > >>>>>> > >>>>>> This should work. The error that you're getting is not from > AvroStorage > >>>>>> but > >>>>>> PigServer. > >>>>>> > >>>>>> grep -r "Schema for .* unknown" * > >>>>>> src/org/apache/pig/PigServer.**java: > >>>>>> System.out.println("Schema for " + alias + " unknown."); > >>>>>> ... > >>>>>> > >>>>>> It looks like that you have an error in your Pig script. Can you > please > >>>>>> provide your Pig script and the schema of your avro files that > reproduce > >>>>>> the error? > >>>>>> > >>>>>> Thanks, > >>>>>> Cheolsoo > >>>>>> > >>>>>> > >>>>>> On Sun, Nov 25, 2012 at 1:02 AM, Bart Verwilst <[email protected]> > wrote: > >>>>>> > >>>>>> Hi, > >>>>>>> > >>>>>>> I've tried loading a csv with PigStorage(), getting this: > >>>>>>> > >>>>>>> > >>>>>>> txt = load '/import.mysql/trace_ejb3_****2011/part-m-00000' USING > >>>>>>> > >>>>>>> PigStorage(','); > >>>>>>> describe txt; > >>>>>>> > >>>>>>> Schema for txt unknown. > >>>>>>> > >>>>>>> Maybe this is because of it being a csv, so a schema is hard to > figure > >>>>>>> out.. > >>>>>>> > >>>>>>> Any other suggestions? Our whole hadoop setup is built around > being able > >>>>>>> to selectively load avro files to run our jobs on, if this doesn't > work > >>>>>>> then we're pretty much screwed.. :) > >>>>>>> > >>>>>>> Thanks in advance! > >>>>>>> > >>>>>>> Bart > >>>>>>> > >>>>>>> Russell Jurney schreef op 24.11.2012 20:23: > >>>>>>> > >>>>>>> I suspect the problem is AvroStorage, not globbing. Try this with > >>>>>>> > >>>>>>>> pigstorage. > >>>>>>>> > >>>>>>>> Russell Jurney twitter.com/rjurney > >>>>>>>> > >>>>>>>> > >>>>>>>> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[email protected]> > wrote: > >>>>>>>> > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks for your suggestion! > >>>>>>>>> I switch my avro variable to avro = load '$INPUT' USING > AvroStorage(); > >>>>>>>>> > >>>>>>>>> However I get the same results this way: > >>>>>>>>> > >>>>>>>>> $ pig -p INPUT=/data/2012/trace_ejb3/****2012-01-02.avro > avro-test.pig > >>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**** > >>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/**** > >>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin) > >>>>>>>>> <snip> > >>>>>>>>> avro: {id: long,timestamp: long,latitude: int,longitude: > int,speed: > >>>>>>>>> int,heading: int,terminalid: int,customerid: chararray,mileage: > >>>>>>>>> int,creationtime: long,tracetype: int,traceproperties: > {ARRAY_ELEM: > >>>>>>>>> (id: > >>>>>>>>> long,value: chararray,pkey: chararray)}} > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> $ pig -p INPUT="/data/2012/trace_ejb3/****2012-01-0[12].avro" > >>>>>>>>> avro-test.pig > >>>>>>>>> > >>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**** > >>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/**** > >>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin) > >>>>>>>>> <snip> > >>>>>>>>> 2012-11-24 14:11:17,309 [main] ERROR > org.apache.pig.tools.grunt.**** > >>>>>>>>> Grunt > >>>>>>>>> > >>>>>>>>> - ERROR 2999: Unexpected internal error. null > >>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in > path at > >>>>>>>>> index 31: /data/2012/trace_ejb3/2012-01-****0[12].avro > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> $ pig -p INPUT='/data/2012/trace_ejb3/****2012-01-0[12].avro' > >>>>>>>>> avro-test.pig > >>>>>>>>> > >>>>>>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**** > >>>>>>>>> java/jdk1.6.0_33/bin/:/usr/****local/bin:/bin:/usr/bin:/usr/**** > >>>>>>>>> local/sbin:/usr/sbin:/sbin:/****usr/local/bin) > >>>>>>>>> <snip> > >>>>>>>>> 2012-11-24 14:12:05,085 [main] ERROR > org.apache.pig.tools.grunt.**** > >>>>>>>>> Grunt > >>>>>>>>> > >>>>>>>>> - ERROR 2999: Unexpected internal error. null > >>>>>>>>> Details at logfile: > /var/lib/hadoop-hdfs/pig_****1353762722742.log > >>>>>>>>> > >>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in > path at > >>>>>>>>> index 31: /data/2012/trace_ejb3/2012-01-****0[12].avro > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Deepak Tiwari schreef op 24.11.2012 00:41: > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I dont have a system to test it right now, but I have been > passing it > >>>>>>>>>> using > >>>>>>>>>> under parameter -p and it works. > >>>>>>>>>> > >>>>>>>>>> change line to accept parameters like avro = load > '$INPUT' > >>>>>>>>>> USING > >>>>>>>>>> AvroStorage(); > >>>>>>>>>> > >>>>>>>>>> bin/pig -p > INPUT="/data/2012/trace_ejb3/****2012-**01-0[12].avro" > >>>>>>>>>> > >>>>>>>>>> <scriptName> > >>>>>>>>>> > >>>>>>>>>> I think if you dont give double quotes then the expansion is > done by > >>>>>>>>>> OS. > >>>>>>>>>> > >>>>>>>>>> Please let us know if it doesnt work... > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst < > [email protected]> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hello, > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I have the following files on HDFS: > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 22989179 2012-11-22 11:17 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******01.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 240551819 2012-11-22 14:27 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******02.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 324464635 2012-11-22 18:28 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******03.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 345526418 2012-11-22 21:30 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******04.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 351322916 2012-11-23 00:28 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******05.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 325953043 2012-11-23 04:32 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******06.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 107019156 2012-11-23 05:58 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******07.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 46392850 2012-11-23 06:37 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******08.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 361970930 2012-11-23 10:06 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******09.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 398462505 2012-11-23 13:44 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******10.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 400785976 2012-11-23 17:17 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******11.avro > >>>>>>>>>>> > >>>>>>>>>>> -rw-r--r-- 3 hdfs supergroup 400027565 2012-11-23 20:43 > >>>>>>>>>>> /data/2012/trace_ejb3/2012-01-******12.avro > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Using Pig 0.10.0-cdh4.1.2, i try to load those files, and > describe > >>>>>>>>>>> them. > >>>>>>>>>>> > >>>>>>>>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; > >>>>>>>>>>> REGISTER 'hdfs:///lib/json-simple-1.1.******1.jar'; > >>>>>>>>>>> REGISTER 'hdfs:///lib/piggybank.jar'; > >>>>>>>>>>> > >>>>>>>>>>> DEFINE AvroStorage org.apache.pig.piggybank.**** > >>>>>>>>>>> storage.avro.AvroStorage(); > >>>>>>>>>>> > >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-01.avro' USING > >>>>>>>>>>> > >>>>>>>>>>> AvroStorage(); > >>>>>>>>>>> > >>>>>>>>>>> describe avro; > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> This works, same with 2012-01-02.avro. > >>>>>>>>>>> > >>>>>>>>>>> However, as soon as i want to include multiple files, no dice. > >>>>>>>>>>> > >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-{01,02}.avro' > USING > >>>>>>>>>>> AvroStorage(); > >>>>>>>>>>> gives me: > >>>>>>>>>>> 2012-11-23 21:41:07,475 [main] ERROR > org.apache.pig.tools.grunt.**** > >>>>>>>>>>> **Grunt > >>>>>>>>>>> > >>>>>>>>>>> - > >>>>>>>>>>> ERROR 2999: Unexpected internal error. null > >>>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in > path at > >>>>>>>>>>> index > >>>>>>>>>>> 30: /data/2012/trace_ejb3/2012-01-******{01,02}.avro > >>>>>>>>>>> > >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-*.avro' USING > >>>>>>>>>>> > >>>>>>>>>>> AvroStorage(); > >>>>>>>>>>> gives me: > >>>>>>>>>>> Schema for avro unknown. > >>>>>>>>>>> > >>>>>>>>>>> avro = load '/data/2012/trace_ejb3/2012-******01-0[12].avro' > USING > >>>>>>>>>>> > >>>>>>>>>>> AvroStorage(); > >>>>>>>>>>> also gives me: > >>>>>>>>>>> Caused by: java.net.URISyntaxException: Illegal character in > path at > >>>>>>>>>>> index > >>>>>>>>>>> 31: /data/2012/trace_ejb3/2012-01-******0[12].avro > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> What am i doing wrong here? According to > >>>>>>>>>>> http://hadoop.apache.org/docs/******< > http://hadoop.apache.org/docs/****>< > >>>>>>>>>>> http://hadoop.apache.org/**docs/**< > http://hadoop.apache.org/docs/**> > >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> r0.21.0/api/org/apache/hadoop/******fs/FileSystem.html#**** > >>>>>>>>>>> globStatus%** > >>>>>>>>>>> > >>>>>>>>>>> 28org.apache.hadoop.fs.Path%****29<http://hadoop.apache.org/** > >>>>>>>>>>> docs/r0.21.0/api/org/apache/****hadoop/fs/FileSystem.html#** > >>>>>>>>>>> > >>>>>>>>>>> globStatus%28org.apache.****hadoop.fs.Path%29<http://** > >>>>>>>>>>> hadoop.apache.org/docs/r0.21.**0/api/org/apache/hadoop/fs/** > >>>>>>>>>>> FileSystem.html#globStatus%**28org.apache.hadoop.fs.Path%29< > http://hadoop.apache.org/docs/r0.21.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 > > > >>>>>>>>>>> **>>, > >>>>>>>>>>> > >>>>>>>>>>> this should all be acceptable input? > >>>>>>>>>>> > >>>>>>>>>>> Thanks in advance! > >>>>>>>>>>> > >>>>>>>>>>> Kind regards, > >>>>>>>>>>> > >>>>>>>>>>> Bart > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> >
