This sounds reasonable. The challenge is mostly that finding split points for arbitrary queries is hard at the moment. I would hold off on doing this until after the 1.4.0 rollout is done. We have a new feature we're rolling out (enabled by 1.4.0, although we're going to have to do some work to backfill the sample with preexisting entities) that involves exposing a uniform random sample of keys as an index so that we can find split points easily. I think it should make this a lot easier.
One other thought: instead of adding a GQL interpreter, you might just add a hook for loading a class provided by the user. That class would implement a Filter interface with a method that takes a Configuration and returns a Query object so in your example, mailing and timestamp would get passed in as Configuration parameters and a query object corresponding to the GQL statement you put would be built by a Filter class provided by the user. It would act kind of like a templating language for building queries. Make sense/sound like a good idea? On Nov 18, 7:01 am, Nacho Coloma <[email protected]> wrote: > > I'm not entirely sure I understand > > the scope of the proposed patch. Are you thinking about adding filters > > > at the DatastoreRecordReader level? It's not entirely clear to me that > > that provides a benefit over just applying the filter at the start of > > the map() function. Totally willing to believe I'm missing something, > > though. > > The map() filter runs against your quota. This is OK for once-only tasks > such as schema upgrades, but Mappers can also be used for repetitive tasks > such as mailing, data cleanup, etc. For these cases, being able to work on a > subset of data is important (process only user accounts with mailing > enabled, for example). > > The biggest problem to resolve is how to specify the filter clause in > mapreduce.xml. I am considering implementing a GQL parser as simple as > possible, and inject servlet request parameters. Something like: > > <property> > <name>mapreduce.mapper.inputformat.datastoreinputformat.query</name> > <value>select * from users where mailing=:value1 and > timestamp<=:value2</value> > </property> > > This implies porting the GQL implementation from python to Java, or > implementing an ANTLR-based parser. I feel like I am reinventing the wheel, > so any suggestion to use something that exists (or aim to a simpler design) > is welcome. > > On a logistical note, for nontrivial contributions, we require a CLA > > > from either you or your employer (depending on who owns the copyright > > for your work) before we can accept significant contributions. The > > relevant forms are at: > >http://code.google.com/legal/individual-cla-v1.0.html > > andhttp://code.google.com/legal/corporate-cla-v1.0.html. Feel free to > > email me privately if this is an issue. > > No problem with that. > > Regards, > > Nacho. -- You received this message because you are subscribed to the Google Groups "Google App Engine for Java" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine-java?hl=en.
