Alan, any thoughts on performance baselines and benchmarks? I am a little surprised that you think SQL is a requirement for 1.0, since it's essentially an overlay, not core functionality.
What about the storage layer rewrite (or is that what you referred to with your first bullet-point)? Also, the subject of making more (or all) operators nestable within a foreach comes up now and then.. would you consider this important for 1.0, or something that can wait? Integration with other languages (a-la PyPig)? The Roadmap on the Wiki is still "as of Q3 2007".... makes it hard for an outside contributor to know where to jump :-). -D On Wed, Jun 24, 2009 at 10:02 AM, Alan Gates <ga...@yahoo-inc.com> wrote: > Integration with Owl is something we want for 1.0. I am hopeful that by > Pig's 1.0 Owl will have flown the coop and become either a subproject or > found a home in Hadoop's common, since it will hopefully be used by multiple > other subprojects. > > Alan. > > > On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote: > > For 1.0 - complete Owl? >> >> http://wiki.apache.org/pig/Metadata >> >> Russell Jurney >> rjur...@cloudstenography.com >> >> >> On Jun 23, 2009, at 4:40 PM, Alan Gates wrote: >> >> I don't believe there's a solid list of want to haves for 1.0. The big >>> issue I see is that there are too many interfaces that are still shifting, >>> such as: >>> >>> 1) Data input/output formats. The way we do slicing (that is, user >>> provided InputFormats) and the equivalent outputs aren't yet solid. They >>> are still too tied to load and store functions. We need to break those out >>> and understand how they will be expressed in the language. Related to this >>> is the semantics of how Pig interacts with non-file based inputs and >>> outputs. We have a suggestion of moving to URLs, but we haven't finished >>> test driving this to see if it will really be what we want. >>> >>> 2) The memory model. While technically the choices we make on how to >>> represent things in memory are internal, the reality is that these changes >>> may affect the way we read and write tuples and bags, which in turn may >>> affect our load, store, eval, and filter functions. >>> >>> 3) SQL. We're working on introducing SQL soon, and it will take it a few >>> releases to be fully baked. >>> >>> 4) Much better error messages. In 0.2 our error messages made a leap >>> forward, but before we can claim to be 1.0 I think they need to make 2 more >>> leaps: 1) they need to be written in a way end users can understand them >>> instead of in a way engineers can understand them, including having >>> sufficient error documentation with suggested courses of action, etc.; 2) >>> they need to be much better at tying errors back to where they happened in >>> the script, right now if one of the MR jobs associated with a Pig Latin >>> script fails there is no way to know what part of the script it is >>> associated with. >>> >>> There are probably others, but those are the ones I can think of off the >>> top of my head. The summary from my viewpoint is we still have several 0.x >>> releases before we're ready to consider 1.0. It would be nice to be 1.0 not >>> too long after Hadoop is, which still gives us at least 6-9 months. >>> >>> Alan. >>> >>> >>> On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote: >>> >>> I know there was some discussion of making the types release (0.2) a >>>> "Pig 1" >>>> release, but that got nixed. There wasn't a similar discussion on 0.3. >>>> Has the list of want-to-haves for Pig 1.0 been discussed since? >>>> >>> >>> >> >