Erik Paulson
Mon, 31 Mar 2008 15:11:00 -0700
On Mon, Mar 24, 2008 at 03:20:02PM -0700, Benjamin Reed wrote: > PigStorage uses regex for splitting as defined in: > > http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#sum > > It looks like you might need to specify PigStorage('[|]'). > > And yes, pig does process directories just like hadoop. Sorry to keep asking beginning questions, but what is the syntax to get pig to load directories? grunt> ls /scratch/epaulson/small/test file:/scratch/epaulson/small/test/foo<r 1> 19 file:/scratch/epaulson/small/test/zot<r 1> 21 file:/scratch/epaulson/small/test/bar<r 1> 19 grunt> cat /scratch/epaulson/small/test/foo first|second|third grunt> cat /scratch/epaulson/small/test/bar fourth|fifth|sixth grunt> cat /scratch/epaulson/small/test/zot seventh|eighth|ninth grunt> dircontents = load '/scratch/epaulson/small/test/' using PigStorage('[|]'); grunt> dump dircontents; 2008-03-31 14:24:26,251 [main] ERROR org.apache.pig.tools.grunt.GruntParser - Unable to open iterator for alias: dircontents Thanks! -Erik > > ben > > On Monday 24 March 2008 15:07:39 Erik Paulson wrote: > > Hello all - > > > > I'm trying to load data that is seperated by '|' characters, using the > > PigStorage layer (using today's SVN) > > > > From following the code in Tuple, I think I'm doing this right, but maybe > > something in the parser is eating my character seperators? > > > > > > > > grunt> cat /tmp/pipeseperated > > first|second|third > > grunt> cat /tmp/commaseperated > > first,second,third > > grunt> pipedata = load '/tmp/pipeseperated' using PigStorage('\\|'); > > grunt> commadata = load '/tmp/commaseperated' using PigStorage(','); > > grunt> dump pipedata > > (, f, i, r, s, t, |, s, e, c, o, n, d, |, t, h, i, r, d, ) > > grunt> dump commadata; > > (first, second, third) > > grunt> trytwo = load '/tmp/pipeseperated' using PigStorage('|'); > > grunt> dump trytwo > > (, f, i, r, s, t, |, s, e, c, o, n, d, |, t, h, i, r, d, ) > > > > > > And a second question: in Hadoop, it's customary to give a path to a > > directory containing all of the input files - is the same thing doable in > > Pig? > > > > Thanks! > > > > -Erik >