This is a good point and I don't want it to fall off the radar. Hoping someone can answer the RequiredFieldList question.
-D On Thu, Jun 10, 2010 at 2:56 PM, Scott Carey <[email protected]>wrote: > I wish there was better documentation on that too. > > Looking at the PigStorage code, it serializes an array of Booleans via > UDFContext to the backend. > > It would be significantly better if Pig serialized the requested fields for > us, provided that pushProjection returned a code that indicated that the > projection would be supported. > > Forcing users to do that serialization themselves is bug prone, especially > in the presence of nested schemas. > > The documentation is also poor when it comes to describing what the > RequiredFieldList even is. > > It has a name and an index field. The code itself seems to allow for > either of these to be filled. What do they mean? > > Is it: > the schema returned by the loader is: > (id: int, name: chararray, department: chararray) > > The RequiredFieldList is [ ("department", 1) , ("id", 0) ] > > What does that mean? > * The name is the field name requested, and the index is the location it > should be in the result? so return (id: int, department: chararray)? > * The index is the index in the source schema, and the name is for > renaming, so return (department: chararray, id: int) (where the data in > department is actualy that from the original's name field)? > * The location in the RequiredFieldList array is the 'destination' > requested, the name is optional (if the schema had one) and the index is the > location in the original schema. so the above RequiredFieldList is actually > impossible, since "department" is always index 2. > > I think it is the last one, but the first idea might be it too. Either way > the javadoc and other documentation does not describe what the meanings of > these values are nor what their possible ranges might be. > > On Jun 5, 2010, at 6:34 PM, Andrew Rothstein wrote: > > > I'm trying to figure out how exactly to appropriately implement the > > LoadPushDown interface in my LoadFunc implementation. I need to take > > the list of column aliases and pass that from the > > LoadPushDown.pushProjection(RequiredFieldList) function to make it > > available in the getTuple function. I'm kind of new to this so forgive > > me if this is obvious. From my readings of the mailing list it appears > > that the pushProjection function is called in the front-end where as > > the getTuple function is called in the back-end. How does a LoanFunc > > pass information from the front to the back end instances? > > > > regards, Andrew > > > > On Thu, Jun 3, 2010 at 7:04 AM, Ankur C. Goel <[email protected]> > wrote: > >> A similar need is being expressed by zebra folks here - > https://issues.apache.org/jira/browse/PIG-1337. > >> You might want to comment/vote on it as it is scheduled for 0.8 release. > >> > >> Loading data in prepareToRead() is fine. For a workaround I think it > should be ok to read the data directly from HDFS in each of the mappers > provided you aren't doing any costly namespace operations like 'listStatus' > that can stress the namesystem in the event of thousands of tasks executing > it concurrently. > >> > >> Regards > >> -...@nkur > >> > >> 6/2/10 10:36 PM, "Scott Carey" <[email protected]> wrote: > >> > >> > >> > >> On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote: > >> > >>> Scott, > >>> You can set hadoop properties at the time of running your pig > script with -D option. So > >>> pig -Dhadoop.property.name=something myscript essentially sets the > property in the job configuration. > >>> > >> > >> So no programatic configuration of hadoop properties is allowed (where > its easier to control) but its allowable to set it at the script level? I > guess I can do that, but it complicates things. > >> Also this is a very poor way to do this. My script has 600 lines of Pig > and ~45 M/R jobs. Only three of the jobs need the distributed cache, not > all 45. > >> > >>> Speaking specifically of utilizing the distributed cache feature, you > can just set the filename in LoadFunc constructor and then load the data in > memory in getNext() method if not already loaded. > >>> > >> > >> That is what the original idea was. > >> > >>> Here is the pig command to set up the distributed cache > >>> > >>> pig > -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name > ---> This name needs to be passed to UDF constructor so that its available > in mapper/reducer's working dir on compute node. > >>> -Dmapred.create.symlink=yes > >>> script.pig > >> > >> If that property is set, then constructor only needs file-name (the > symlink) right? Right now I'm trying to set those properties using the > DistributedCache static interfaces which means I need to have access to the > full path. > >> > >>> > >>> Implement something like a loadData() method that loads the data only > once and invoke it from getNext() method. The script will work even in the > local mode if the file distributed via distributed cache resides in the CWD > from where script is invoked. > >>> > >> > >> I'm loading the data in prepareToRead(), which seems most appropriate. > Do you see any problem with that? > >> > >>> Hope that's helpful. > >> > >> I think the command line property hack is insufficient. I am left with > a choice of having a couple jobs read the file from HDFS directly in their > mappers, or having all jobs unnecessarily set up distributed cache. Job > setup time is already 1/4 of my processing time. > >> Is there a feature request for Load/Store access to Hadoop job > configuration properties? > >> > >> Ideally, this would be a method on LoadFunc that passes a modifiable > Configuration object in on the front-end, or a callback for a user to > optionally provide a Configuration object with the few properties you want > to alter in it that Pig can apply to the real thing before it configures its > properties. > >> > >> Thanks for the info Ankur, > >> > >> -Scott > >> > >>> > >>> -...@nkur > >>> > >>> On 6/2/10 2:53 PM, "Scott Carey" <[email protected]> wrote: > >>> > >>> So, here are some things I'm struggling with now: > >>> > >>> In a LoadFunc, If I want to load something into DistributedCache. The > path is passed into the LoadFunc constructor as an argument. > >>> Documentation on getSchema() and all other metadata methods state that > you can't modify the job or its configuration passed in. I've verified that > changes to the Configuration are ignored if set here. > >>> > >>> It appears that I could set these properties in setLocation() but that > is called a lot on the back-end too, and the documentation does not state if > setLocation() is called at all on the front-end. Based on my experimental > results, it doesn't seem to. > >>> Is there no way to modify Hadoop properties on the front-end to utilize > hadoop features? UDFContext seems completely useless for setting hadoop > properties for things other than the UDF itself -- like distributed cache > settings. A stand-alone front-end hook for this would be great. Otherwise, > any hack that works would be acceptable for now. > >>> > >>> > >>> * The documentation for LoadMetadata can use some information about > when each method gets called -- front end only? Between what other calls? > >>> * UDFContext's documentation needs help too -- > >>> ** addJobConf() is public, but not expected to be used by end-users, > right? Several public methods here look like they need better > documentation, and the class itself could use a javadoc entry with some > example uses. > >>> > >>> > >>> On May 24, 2010, at 11:06 AM, Alan Gates wrote: > >>> > >>>> Scott, > >>>> > >>>> I made an effort to address the documentation in > https://issues.apache.org/jira/browse/PIG-1370 > >>>> If you have a chance take a look and let me know if it deals with > >>>> the issues you have or if more work is needed. > >>>> > >>>> Alan. > >>>> > >>>> On May 24, 2010, at 11:00 AM, Scott Carey wrote: > >>>> > >>>>> I have been using these documents for a couple weeks, implementing > >>>>> various store and load functionality, and they have been very > helpful. > >>>>> > >>>>> However, there is room for improvement. What is most unclear is > >>>>> when the API methods get called. Each method should clearly state > >>>>> in these documents (and the javadoc) when it is called -- front-end > >>>>> only? back-end only? both? Sometimes this is obvious, other times > >>>>> it is not. > >>>>> For example, without looking at the source code its not possible to > >>>>> tell or infer if pushProjection() is called on the front-end or back- > >>>>> end, or both. It could be implemented by being called on the front- > >>>>> end, expecting the loader implementation to persist necessary state > >>>>> to UDFContext for the back-end, or be called only on the back-end, > >>>>> or both. One has to look at PigStorage source to see that it > >>>>> persists the pushProjection information into UDFContext, so its > >>>>> _probably_ only called on the front-end. > >>>>> > >>>>> There are also a few types that these interfaces return or are > >>>>> provided that are completely undocumented. I had to look at the > >>>>> source code to figure out what ResourceStatistics does, and how > >>>>> ResourceSchema should be used. RequiredField, RequiredFieldList, > >>>>> and RequiredFieldResponse are all poorly documented aspects of a > >>>>> public interface. > >>>>> > >>>>> > >>>>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote: > >>>>> > >>>>>> To add to this, there is also a how-to document on how to go about > >>>>>> writing load/store functions from scratch in Pig 0.7 at > >>>>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo. > >>>>>> > >>>>>> Pradeep > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: Alan Gates [mailto:[email protected]] > >>>>>> Sent: Friday, May 21, 2010 11:33 AM > >>>>>> To: [email protected] > >>>>>> Cc: Eli Collins > >>>>>> Subject: Pig loader 0.6 to 0.7 migration guide > >>>>>> > >>>>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I > might > >>>>>> be remembering incorrectly) asked if there was a migration guide for > >>>>>> moving Pig load and store functions from 0.6 to 0.7. I said there > >>>>>> was > >>>>>> but I couldn't remember if it had been posted yet or not. In fact > it > >>>>>> had already been posted to > >>>>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide > >>>>>> . Also, you can find the list of all incompatible changes for 0.7 > at > >>>>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges > >>>>>> . Sorry, I should have included those links in my original slides. > >>>>>> > >>>>>> Alan. > >>>>> > >>>> > >>> > >>> > >> > >> > >> > >
