On Jun 2, 2010, at 4:49 AM, Ankur C. Goel wrote: > Scott, > You can set hadoop properties at the time of running your pig script > with -D option. So > pig -Dhadoop.property.name=something myscript essentially sets the property > in the job configuration. >
So no programatic configuration of hadoop properties is allowed (where its easier to control) but its allowable to set it at the script level? I guess I can do that, but it complicates things. Also this is a very poor way to do this. My script has 600 lines of Pig and ~45 M/R jobs. Only three of the jobs need the distributed cache, not all 45. > Speaking specifically of utilizing the distributed cache feature, you can > just set the filename in LoadFunc constructor and then load the data in > memory in getNext() method if not already loaded. > That is what the original idea was. > Here is the pig command to set up the distributed cache > > pig > -Dmapred.cache.files="hdfs://namenode-host:port/path/to/file/for/distributed-cache#file-name > ---> This name needs to be passed to UDF constructor so that its available > in mapper/reducer's working dir on compute node. > -Dmapred.create.symlink=yes > script.pig If that property is set, then constructor only needs file-name (the symlink) right? Right now I'm trying to set those properties using the DistributedCache static interfaces which means I need to have access to the full path. > > Implement something like a loadData() method that loads the data only once > and invoke it from getNext() method. The script will work even in the local > mode if the file distributed via distributed cache resides in the CWD from > where script is invoked. > I'm loading the data in prepareToRead(), which seems most appropriate. Do you see any problem with that? > Hope that's helpful. I think the command line property hack is insufficient. I am left with a choice of having a couple jobs read the file from HDFS directly in their mappers, or having all jobs unnecessarily set up distributed cache. Job setup time is already 1/4 of my processing time. Is there a feature request for Load/Store access to Hadoop job configuration properties? Ideally, this would be a method on LoadFunc that passes a modifiable Configuration object in on the front-end, or a callback for a user to optionally provide a Configuration object with the few properties you want to alter in it that Pig can apply to the real thing before it configures its properties. Thanks for the info Ankur, -Scott > > -...@nkur > > On 6/2/10 2:53 PM, "Scott Carey" <[email protected]> wrote: > > So, here are some things I'm struggling with now: > > In a LoadFunc, If I want to load something into DistributedCache. The path > is passed into the LoadFunc constructor as an argument. > Documentation on getSchema() and all other metadata methods state that you > can't modify the job or its configuration passed in. I've verified that > changes to the Configuration are ignored if set here. > > It appears that I could set these properties in setLocation() but that is > called a lot on the back-end too, and the documentation does not state if > setLocation() is called at all on the front-end. Based on my experimental > results, it doesn't seem to. > Is there no way to modify Hadoop properties on the front-end to utilize > hadoop features? UDFContext seems completely useless for setting hadoop > properties for things other than the UDF itself -- like distributed cache > settings. A stand-alone front-end hook for this would be great. Otherwise, > any hack that works would be acceptable for now. > > > * The documentation for LoadMetadata can use some information about when each > method gets called -- front end only? Between what other calls? > * UDFContext's documentation needs help too -- > ** addJobConf() is public, but not expected to be used by end-users, right? > Several public methods here look like they need better documentation, and the > class itself could use a javadoc entry with some example uses. > > > On May 24, 2010, at 11:06 AM, Alan Gates wrote: > >> Scott, >> >> I made an effort to address the documentation in >> https://issues.apache.org/jira/browse/PIG-1370 >> If you have a chance take a look and let me know if it deals with >> the issues you have or if more work is needed. >> >> Alan. >> >> On May 24, 2010, at 11:00 AM, Scott Carey wrote: >> >>> I have been using these documents for a couple weeks, implementing >>> various store and load functionality, and they have been very helpful. >>> >>> However, there is room for improvement. What is most unclear is >>> when the API methods get called. Each method should clearly state >>> in these documents (and the javadoc) when it is called -- front-end >>> only? back-end only? both? Sometimes this is obvious, other times >>> it is not. >>> For example, without looking at the source code its not possible to >>> tell or infer if pushProjection() is called on the front-end or back- >>> end, or both. It could be implemented by being called on the front- >>> end, expecting the loader implementation to persist necessary state >>> to UDFContext for the back-end, or be called only on the back-end, >>> or both. One has to look at PigStorage source to see that it >>> persists the pushProjection information into UDFContext, so its >>> _probably_ only called on the front-end. >>> >>> There are also a few types that these interfaces return or are >>> provided that are completely undocumented. I had to look at the >>> source code to figure out what ResourceStatistics does, and how >>> ResourceSchema should be used. RequiredField, RequiredFieldList, >>> and RequiredFieldResponse are all poorly documented aspects of a >>> public interface. >>> >>> >>> On May 21, 2010, at 11:42 AM, Pradeep Kamath wrote: >>> >>>> To add to this, there is also a how-to document on how to go about >>>> writing load/store functions from scratch in Pig 0.7 at >>>> http://wiki.apache.org/pig/Pig070LoadStoreHowTo. >>>> >>>> Pradeep >>>> >>>> -----Original Message----- >>>> From: Alan Gates [mailto:[email protected]] >>>> Sent: Friday, May 21, 2010 11:33 AM >>>> To: [email protected] >>>> Cc: Eli Collins >>>> Subject: Pig loader 0.6 to 0.7 migration guide >>>> >>>> At the Bay Area HUG on Wednesday someone (Eli I think, though I might >>>> be remembering incorrectly) asked if there was a migration guide for >>>> moving Pig load and store functions from 0.6 to 0.7. I said there >>>> was >>>> but I couldn't remember if it had been posted yet or not. In fact it >>>> had already been posted to >>>> http://wiki.apache.org/pig/LoadStoreMigrationGuide >>>> . Also, you can find the list of all incompatible changes for 0.7 at >>>> http://wiki.apache.org/pig/Pig070IncompatibleChanges >>>> . Sorry, I should have included those links in my original slides. >>>> >>>> Alan. >>> >> > >
