Re: Creating a SequentialFileLoader

Dmitriy Ryaboy Mon, 24 May 2010 14:07:33 -0700

Vishal,
The way we handle all development is through creating tickets on the Pig
jira and attaching patches that address those tickets; one of the committers
then reviews the patches and provides feedback or commits the patch to Pig.


More info here: http://wiki.apache.org/pig/HowToContribute

-D

On Mon, May 24, 2010 at 1:50 PM, Vishal Santoshi
<[email protected]>wrote:

> I will spruce it up and there are a few changes to make the abstraction
> better as in limiting the columns for performance is done in the concrete
> impls .. hardly a good option for reusability.
>
> I am sure though that I do not have submit rights to Pig github.
>
> On Mon, May 24, 2010 at 4:21 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
>
> > Vishal,
> > Now I get it. Looks really good actually. It would be great if you
> polished
> > this up and submitted to piggybank.
> >
> > -D
> >
> > On Mon, May 24, 2010 at 1:14 PM, Vishal Santoshi
> > <[email protected]>wrote:
> >
> > > Using the SequentialFileInputFormat ( or an extension of it )
> > >
> > > @Override
> > >
> > > public InputFormat getInputFormat() throws IOException {
> > >
> > > return new SequenceFileInputFormat<Writable, Writable>();
> > >
> > > }
> > >
> > >
> > > and
> > >
> > >
> > > @Override
> > >
> > > public void setLocation(String location, Job job) throws IOException {
> > >
> > > FileInputFormat.setInputPaths(job, location);
> > >
> > > }
> > >
> > >
> > >
> > >
> > > is all that one would need I think. This is pig0.7.0 though.
> > > I have only tried things like a/*/xyz*   , type of patterns and they
> have
> > > worked for me.
> > >
> > > On Mon, May 24, 2010 at 4:05 PM, Edward Capriolo <
> [email protected]
> > > >wrote:
> > >
> > > > Sounds great. I do not need all the things you do but I do need a
> > > sequence
> > > > file loaded that will take globs or directories. The current loader
> > only
> > > > lets you load a since file.
> > > >
> > > > On Mon, May 24, 2010 at 4:03 PM, Vishal Santoshi
> > > > <[email protected]>wrote:
> > > >
> > > > > All said and done , does this smell a hack.. or is it acceptable ,
> > for
> > > my
> > > > > use case , where I only am interested , in making my Sequential
> Files
> > > and
> > > > > it's contents use Pig , to it's fullest?
> > > > >
> > > > > On Mon, May 24, 2010 at 3:59 PM, Vishal Santoshi
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > Sorry Dmitry.
> > > > > >
> > > > > > Let me explain our issue more lucidly. We have most of our MR
> jobs
> > > use
> > > > > raw
> > > > > > hadoop ( java impl ) and create SequentialFiles with varying
> Custom
> > > > > > Writables.
> > > > > > PigStorage is limited to TextFormat and there is an
> implementation
> > in
> > > > > > piggybank for SequentialFile Loading, which it seems is limited ,
> > in
> > > > the
> > > > > > sense that
> > > > > > it
> > > > > >
> > > > > > * does not provide for Custom Formats ( like a TextPair or a
> Score
> > > that
> > > > > may
> > > > > > use basic Writables like Text,DoubleWritable etc )
> > > > > > * does not provide for type/name mapping ( the "AS" clause )
> > > > > > * does not provide for limiting the  inputs u may be interested
> in.
> > > > > >
> > > > > > I want to use a Loader to provide for something like this
> > > > > >
> > > > > > LOAD 'input' USING SequenceFileLoader AS
> (f1:chararray,2:chararray,
> > > > > > f3:long, f4:chararray, f5:chararray, f6:chararray,f7:double);
> > > > > >
> > > > > > Now this is well and good and easy to write , if we have some
> > > standard
> > > > > > (Text,NullWritable ), Sequential File , with the Text having ","
> > > > > separated
> > > > > > columns ( almost a Pig Storage , but feeding off a Sequential
> File
> > ).
> > > > > >
> > > > > >
> > > > > > In cases though , where we have a  Sequential File (
> > > > > > CustomWritableKey, CustomWritableValue ) where we still would
> like
> > to
> > > > > > extract the raw types and aggregate on, the above fails , as the
> > > > > chararray,
> > > > > > int etc are limited to known types ( and I may be wrong here ).
> > > > > >
> > > > > > What there fore I tried was to reduce the  CustomWritables to
> their
> > > raw
> > > > > > types , using a injectable Converter. This converter, takes the
> > > > > > CustomWritable ( key and value of a SequentialFile )  and returns
> > the
> > > > > >  ArrayList<Object> that are the CustomWritable's reduced to their
> > > base
> > > > > types
> > > > > > and use the List returned to create the Tuple that has to be
> > returned
> > > > > from
> > > > > > getNext().
> > > > > >
> > > > > > I think this code is more likely to tell the tale better.
> > > > > >
> > > > > > http://pastebin.com/QEwMztjU
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, May 24, 2010 at 3:32 PM, Dmitriy Ryaboy <
> > [email protected]
> > > > > >wrote:
> > > > > >
> > > > > >> Vishal,
> > > > > >> I am not sure what your question is. Could you describe your
> goals
> > > and
> > > > > >> challenges before pasting in the implementation? It looks like
> the
> > > > > bottom
> > > > > >> part of your email, with all the comments, got malformatted,
> which
> > > may
> > > > > be
> > > > > >> the source of my confusion.
> > > > > >>
> > > > > >> Also, various services like pastebin and gist work better for
> code
> > > > > >> sharing,
> > > > > >> as they can take care of highlighting and things of that nature,
> > > which
> > > > > is
> > > > > >> handy for reviews.
> > > > > >>
> > > > > >> Thanks
> > > > > >> -Dmitriy
> > > > > >>
> > > > > >> On Mon, May 24, 2010 at 9:41 AM, Vishal Santoshi
> > > > > >> <[email protected]>wrote:
> > > > > >>
> > > > > >> > I have this working , so  seeking validation and corrections.
> > > > > >> > We have SequentialFiles with various CustomWritables in hadoop
> > and
> > > > we
> > > > > >> want
> > > > > >> > to able to work with them from within pig
> > > > > >> >
> > > > > >> > I have taken PigStorage and the piggybank SequentialFileLoader
> > as
> > > a
> > > > > >> > template
> > > > > >> >  and added pluggable converters that are fed through
> > > > > >> > the SequentialFileLoader ( which has a default ).
> > > > > >> > The below is part of the java file.
> > > > > >> >
> > > > > >> > public class SequenceFileLoader extends FileInputLoadFunc
> > > > > >> > implementsLoadPushDown{
> > > > > >> >
> > > > > >> >        public SequenceFileLoader() {
> > > > > >> >
> > > > > >> > converter = new TextConverter();
> > > > > >> >
> > > > > >> >  }
> > > > > >> >
> > > > > >> >  @SuppressWarnings("unchecked")
> > > > > >> >
> > > > > >> > public SequenceFileLoader(String
> > > customWritableToTupleBaseCoverter)
> > > > > >> > throwsFrontendException{
> > > > > >> >
> > > > > >> >  try {
> > > > > >> >
> > > > > >> > converter =
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> (CustomWritableToTupleBaseConverter)Class.forName(customWritableToTupleBaseCoverter).newInstance();
> > > > > >> >
> > > > > >> > } catch (Exception e) {
> > > > > >> >
> > > > > >> > throw new FrontendException(e);
> > > > > >> >
> > > > > >> > }
> > > > > >> >
> > > > > >> > }
> > > > > >> >
> > > > > >> >        @SuppressWarnings("unchecked")
> > > > > >> >
> > > > > >> > @Override
> > > > > >> >
> > > > > >> > public Tuple getNext() throws IOException {
> > > > > >> >
> > > > > >> > if (!mRequiredColumnsInitialized) {
> > > > > >> >
> > > > > >> > if (signature!=null) {
> > > > > >> >
> > > > > >> > Properties p =
> > > > > >> > UDFContext.getUDFContext().getUDFProperties(this.getClass());
> > > > > >> >
> > > > > >> > mRequiredColumns =
> > > > > >> (boolean[])ObjectSerializer.deserialize(p.getProperty(
> > > > > >> > signature));
> > > > > >> >
> > > > > >> > }
> > > > > >> >
> > > > > >> > mRequiredColumnsInitialized = true;
> > > > > >> >
> > > > > >> > }
> > > > > >> >
> > > > > >> > boolean next = false;
> > > > > >> >
> > > > > >> > try {
> > > > > >> >
> > > > > >> > next = reader.nextKeyValue();
> > > > > >> >
> > > > > >> > } catch (InterruptedException e) {
> > > > > >> >
> > > > > >> > throw new IOException(e);
> > > > > >> >
> > > > > >> > }
> > > > > >> >
> > > > > >> >
> > > > > >> >  if (!next) return null;
> > > > > >> >
> > > > > >> >
> > > > > >> >  key = reader.getCurrentKey();
> > > > > >> >
> > > > > >> > value = reader.getCurrentValue();
> > > > > >> >
> > > > > >> > converter.populateTupleList(key,
> > > > value,mRequiredColumns,mProtoTuple);
> > > > > >> >
> > > > > >> > Tuple t =  mTupleFactory.newTuple(mProtoTuple);
> > > > > >> >
> > > > > >> > mProtoTuple.clear();
> > > > > >> >
> > > > > >> > return t;
> > > > > >> >
> > > > > >> > }
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > and
> > > > > >> >
> > > > > >> > public abstract class CustomWritableToTupleBaseConverter<K
> > extends
> > > > > >> > Writable,
> > > > > >> > V extends Writable>{
> > > > > >> >
> > > > > >> >
> > > > > >> >   public abstract void populateTupleList(K time, V value,
> > > boolean[]
> > > > > >> > mRequiredColumns, ArrayList<Object> mProtoTuple) throws
> > > IOException;
> > > > > >> >
> > > > > >> >
> > > > > >> > }
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Features * Allows for a Default Format (TextConverter) **
> Text,
> > > > > >> > NullWritable
> > > > > >> > *** Text is treated as a COMMA(",") separated Text Array ****
> > > > Consider
> > > > > a
> > > > > >> > Text with values as 1 , 2 , 3 **** grunt> DEFINE
> > > SequenceFileLoader
> > > > > >> > com.medialets.hadoop.pig.SequenceFileLoader() **** grunt> A =
> > LOAD
> > > > > >> 'input'
> > > > > >> > USING SequenceFileLoader **** grunt> B = FOREACH A GENERATE $3
> > > ****
> > > > > >> grunt>
> > > > > >> > 3
> > > > > >> > * Allows for custom formats (example
> > > TimeWritableTestLongConverter)
> > > > **
> > > > > >> It
> > > > > >> > is
> > > > > >> > upto the Custom Converter to provide the SequenceFileLoader
> with
> > > the
> > > > > >> > Writables *** public abstract void populateTupleList(K time, V
> > > > value,
> > > > > >> > boolean[] mRequiredColumns, ArrayList<Object> mProtoTuple)
> > throws
> > > > > >> > IOException; in the base class
> > CustomWritableToTupleBaseConverter.
> > > > ***
> > > > > >> The
> > > > > >> > Custom Converter has to convert it's Key/Value ( as specified
> by
> > > the
> > > > > >> > SequenceFile ) into a List of Pig recognizable DataTypes ****
> > > grunt>
> > > > > >> DEFINE
> > > > > >> > SequenceFileLoader
> > > a.b.c.SequenceFileLoader('a.b.b.SomeConverter');
> > > > > ****
> > > > > >> > grunt> A = LOAD 'input' USING SequenceFileLoader AS
> > (f1:chararray,
> > > > > >> > f2:chararray, f3:long, f4:chararray, f5:chararray,
> f6:chararray,
> > > > > >> > f7:double);
> > > > > >> > **** grunt> B = FILTER A BY f7 + 1 >.5; ** Note that , Pig has
> > to
> > > be
> > > > > >> told
> > > > > >> > as
> > > > > >> > to what is the type of the column , for it to do the right
> > > > conversion.
> > > > > >> In
> > > > > >> > the above example is f7 is not defined as double, it will try
> to
> > > > cast
> > > > > it
> > > > > >> > into an int , as we adding a 1 to the value. ** Note that the
> > > custom
> > > > > >> > converter is an argument defined in the DEFINE call. * Allows
> > for
> > > > > >> limiting
> > > > > >> > the number of columns in the input ** grunt> A = LOAD 'input'
> > > USING
> > > > > >> > SequenceFileLoader AS (f1:chararray, f2:chararray, f3:long,
> > > > > >> f4:chararray,
> > > > > >> > f5:chararray, f6:chararray, f7:double);
> > > > > >> >
> > > > > >> >
> > > > > >> > Any issues any one sees in this approach?
> > > > > >> >
> > > > > >> > I have chosen the path of least resistance .. so any guidance
> > will
> > > > be
> > > > > >> > appreciated.
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Creating a SequentialFileLoader

Reply via email to