Re: Pig loader 0.6 to 0.7 migration guide

Dmitriy Ryaboy Tue, 15 Jun 2010 12:38:06 -0700

This is a good point and I don't want it to fall off the radar.
Hoping someone can answer the RequiredFieldList question.


-D

On Thu, Jun 10, 2010 at 2:56 PM, Scott Carey <[email protected]>wrote:

> I wish there was better documentation on that too.
>
> Looking at the PigStorage code, it serializes an array of Booleans via
> UDFContext to the backend.
>
> It would be significantly better if Pig serialized the requested fields for
> us, provided that pushProjection returned a code that indicated that the
> projection would be supported.
>
> Forcing users to do that serialization themselves is bug prone, especially
> in the presence of nested schemas.
>
> The documentation is also poor when it comes to describing what the
> RequiredFieldList even is.
>
> It has a name and an index field.   The code itself seems to allow for
> either of these to be filled.  What do they mean?
>
> Is it:
> the schema returned by the loader is:
>  (id: int, name: chararray, department: chararray)
>
> The RequiredFieldList is [ ("department", 1) , ("id", 0) ]
>
> What does that mean?
> * The name is the field name requested, and the index is the location it
> should be in the result?  so return (id: int, department: chararray)?
> * The index is the index in the source schema, and the name is for
> renaming, so return (department: chararray, id: int) (where the data in
> department is actualy that from the original's name field)?
> * The location in the RequiredFieldList array is the 'destination'
> requested, the name is optional (if the schema had one) and the index is the
> location in the original schema.  so the above RequiredFieldList is actually
> impossible, since "department" is always index 2.
>
> I think it is the last one, but the first idea might be it too.  Either way
> the javadoc and other documentation does not describe what the meanings of
> these values are nor what their possible ranges might be.
>
> On Jun 5, 2010, at 6:34 PM, Andrew Rothstein wrote:
>
> > I'm trying to figure out how exactly to appropriately implement the
> > LoadPushDown interface in my LoadFunc implementation. I need to take
> > the list of column aliases and pass that from the
> > LoadPushDown.pushProjection(RequiredFieldList) function to make it
> > available in the getTuple function. I'm kind of new to this so forgive
> > me if this is obvious. From my readings of the mailing list it appears
> > that the pushProjection function is called in the front-end where as
> > the getTuple function is called in the back-end. How does a LoanFunc
> > pass information from the front to the back end instances?
> >
> > regards, Andrew
> >
> > On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <[email protected]>
> wrote:
> >> A similar need is being expressed by zebra folks here -
> https://issues.apache.org/jira/browse/PIG-1337.
> >> You might want to comment/vote on it as it is scheduled for 0.8 release.
> >>
> >> Loading data in prepareToRead() is fine. For a workaround I think it
> should be ok to read the data directly from HDFS in each of the mappers
> provided you aren't doing any costly namespace operations like 'listStatus'
> that can stress the namesystem in the event of thousands of tasks executing
> it concurrently.
> >>
> >> Regards
> >> -...@nkur
> >>
> >>  6/2/10 10:36 PM, "Scott Carey" <[email protected]> wrote:
> >>
> >>
> >>
> >> On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote:
> >>
> >>> Scott,
> >>>       You can set hadoop properties at the time of running your pig
> script with -D option. So
> >>> pig -Dhadoop.property.name=something myscript essentially sets the
> property in the job configuration.
> >>>
> >>
> >> So no programatic configuration of hadoop properties is allowed (where
> its easier to control) but its allowable to set it at the script level?  I
> guess I can do that, but it complicates things.
> >> Also this is a very poor way to do this.  My script has 600 lines of Pig
> and ~45 M/R jobs.  Only three of the jobs need the distributed cache, not
> all 45.
> >>
> >>> Speaking specifically of utilizing the distributed cache feature, you
> can just set the filename in LoadFunc constructor and then load the data in
> memory in getNext() method if not already loaded.
> >>>
> >>
> >> That is what the original idea was.
> >>
> >>> Here is the pig command to set up the distributed cache
> >>>
> >>> pig
> -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name
>   ---> This name needs to be passed to UDF constructor so that its available
> in mapper/reducer's working dir on compute node.
> >>>       -Dmapred.create.symlink=yes
> >>>       script.pig
> >>
> >> If that property is set, then constructor only needs file-name (the
> symlink) right?  Right now I'm trying to set those properties using the
> DistributedCache static interfaces which means I need to have access to the
> full path.
> >>
> >>>
> >>> Implement something like a loadData() method that loads the data only
> once and invoke it from getNext() method. The script will work even in the
> local mode if the file distributed via distributed cache resides in the CWD
> from where script is invoked.
> >>>
> >>
> >> I'm loading the data in prepareToRead(), which seems most appropriate.
>  Do you see any problem with that?
> >>
> >>> Hope that's helpful.
> >>
> >> I think the command line property hack is insufficient.  I am left with
> a choice of having a couple jobs read the file from HDFS directly in their
> mappers, or having all jobs unnecessarily set up distributed cache.  Job
> setup time is already 1/4 of my processing time.
> >> Is there a feature request for Load/Store access to Hadoop job
> configuration properties?
> >>
> >> Ideally, this would be a method on LoadFunc that passes a modifiable
> Configuration object in on the front-end, or a callback for a user to
> optionally provide a Configuration object with the few properties you want
> to alter in it that Pig can apply to the real thing before it configures its
> properties.
> >>
> >> Thanks for the info Ankur,
> >>
> >> -Scott
> >>
> >>>
> >>> -...@nkur
> >>>
> >>> On 6/2/10 2:53 PM, "Scott Carey" <[email protected]> wrote:
> >>>
> >>> So, here are some things I'm struggling with now:
> >>>
> >>> In a LoadFunc, If I want to load something into DistributedCache.  The
> path is passed into the LoadFunc constructor as an argument.
> >>> Documentation on getSchema() and all other metadata methods state that
> you can't modify the job or its configuration passed in.  I've verified that
> changes to the Configuration are ignored if set here.
> >>>
> >>> It appears that I could set these properties in setLocation() but that
> is called a lot on the back-end too, and the documentation does not state if
> setLocation() is called at all on the front-end.  Based on my experimental
> results, it doesn't seem to.
> >>> Is there no way to modify Hadoop properties on the front-end to utilize
> hadoop features?  UDFContext seems completely useless for setting hadoop
> properties for things other than the UDF itself -- like distributed cache
> settings.  A stand-alone front-end hook for this would be great.  Otherwise,
> any hack that works would be acceptable for now.
> >>>
> >>>
> >>> * The documentation for LoadMetadata can use some information about
> when each method gets called -- front end only?  Between what other calls?
> >>> * UDFContext's documentation needs help too --
> >>> ** addJobConf() is public, but not expected to be used by end-users,
> right?  Several public methods here look like they need better
> documentation, and the class itself could use a javadoc entry with some
> example uses.
> >>>
> >>>
> >>> On May 24, 2010, at 11:06 AM, Alan Gates wrote:
> >>>
> >>>> Scott,
> >>>>
> >>>> I made an effort to address the documentation in
> https://issues.apache.org/jira/browse/PIG-1370
> >>>> If you have a chance take a look and let me know if it deals with
> >>>> the issues you have or if more work is needed.
> >>>>
> >>>> Alan.
> >>>>
> >>>> On May 24, 2010, at 11:00 AM, Scott Carey wrote:
> >>>>
> >>>>> I have been using these documents for a couple weeks, implementing
> >>>>> various store and load functionality, and they have been very
> helpful.
> >>>>>
> >>>>> However, there is room for improvement.  What is most unclear is
> >>>>> when the API methods get called.  Each method should clearly state
> >>>>> in these documents (and the javadoc) when it is called -- front-end
> >>>>> only? back-end only?  both?  Sometimes this is obvious, other times
> >>>>> it is not.
> >>>>> For example, without looking at the source code its not possible to
> >>>>> tell or infer if pushProjection() is called on the front-end or back-
> >>>>> end, or both.  It could be implemented by being called on the front-
> >>>>> end, expecting the loader implementation to persist necessary state
> >>>>> to UDFContext for the back-end, or be called only on the back-end,
> >>>>> or both.  One has to look at PigStorage source to see that it
> >>>>> persists the pushProjection information into UDFContext, so its
> >>>>> _probably_ only called on the front-end.
> >>>>>
> >>>>> There are also a few types that these interfaces return or are
> >>>>> provided that are completely undocumented.  I had to look at the
> >>>>> source code to figure out what ResourceStatistics does, and how
> >>>>> ResourceSchema should be used.  RequiredField, RequiredFieldList,
> >>>>> and RequiredFieldResponse are all poorly documented aspects of a
> >>>>> public interface.
> >>>>>
> >>>>>
> >>>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote:
> >>>>>
> >>>>>> To add to this, there is also a how-to document on how to go about
> >>>>>> writing load/store functions from scratch in Pig 0.7 at
> >>>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
> >>>>>>
> >>>>>> Pradeep
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Alan Gates [mailto:[email protected]]
> >>>>>> Sent: Friday, May 21, 2010 11:33 AM
> >>>>>> To: [email protected]
> >>>>>> Cc: Eli Collins
> >>>>>> Subject: Pig loader 0.6 to 0.7 migration guide
> >>>>>>
> >>>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I
> might
> >>>>>> be remembering incorrectly) asked if there was a migration guide for
> >>>>>> moving Pig load and store functions from 0.6 to 0.7.  I said there
> >>>>>> was
> >>>>>> but I couldn't remember if it had been posted yet or not.  In fact
> it
> >>>>>> had already been posted to
> >>>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide
> >>>>>> .  Also, you can find the list of all incompatible changes for 0.7
> at
> >>>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges
> >>>>>> .  Sorry, I should have included those links in my original slides.
> >>>>>>
> >>>>>> Alan.
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
>
>

Re: Pig loader 0.6 to 0.7 migration guide

Reply via email to