Tanton Gibbs
Fri, 20 Jun 2008 16:31:29 -0700
First, I really like the formatting and the explanation behind the tutorial. I thought it was well written and gave a lot of useful information. Now, for the potential improvements: 1) I'm a bit annoyed at how many UDFs I need to "write" to do the work done by example 1. This is somewhat of a turn off. UDFs 1. NonURLDetector - determine if query field is empty or a URL (by the way, shouldn't this be a URLDetector if it decides that it is a URL? The negative of a negative may get confusing). 2. toLower - lowercase 3. extractHour - pull out hour from datetime stamp 4. NGramGenerator - generate ngrams 5. ScoreGenerator - generate scores It seems to me that the first three are just simple regexs or substitutions. Could there not just be a replace or match function that takes the place of all of these "custom" UDFs? I'm not saying it should be builtin to the language, just a UDF that does match or replace. The second two are the actual "logic" and are what someone would expect to have to write. 2) The source code for the UDFs doesn't come in the .tar.gz file. Of course, if they have the svn repository checked out they can get to it, but it would be nice to include it in the original download...perhaps you could just put it in the tutorial.jar? 3) I never get the "full" pig scripts on the web page, only the decomposed ones that you have commented on. It might be nice to see, after the description, the final script. 4) On the web page, I don't see any examples. Of course, I can download it and run it to see what it does, but it would be nice to have a few records done via "illustrate" on the web page. That way I could see how the records change after each pig statement. The user comments are nice, but nothing helps like an example. If you think any of these ideas are worthwhile, I'd be happy to do them, just let me know. Finally, the size of a language's standard library is a determining factor to its success. PiggyBank looks to be a good start, but I think you're going to need to put some thought into what UDFs are packaged as "standard" with Pig. These functions will need to be of a higher quality than those allowed in the PiggyBank. Things like match, replace, the math functions, etc... would make good candidates. Of course there are many, many more. I imagine, though, that there could be a promotion path from the PiggyBank into the standard library. Thanks! Tanton On Fri, Jun 20, 2008 at 5:35 PM, Olga Natkovich <[EMAIL PROTECTED]> wrote: > Hi, > > If you are new to Pig, the best place to start is by trying out our > brand new tutorial: http://wiki.apache.org/pig/PigTutorial. > > We hope that you find it useful and informative! > > As always, your feedback is welcome! > > Olga >