[appengine-java] Re: mapreduce - passing filters

frew Tue, 23 Nov 2010 20:02:08 -0800

This sounds reasonable. The challenge is mostly that finding split
points for arbitrary queries is hard at the moment. I would hold off
on doing this until after the 1.4.0 rollout is done. We have a new
feature we're rolling out (enabled by 1.4.0, although we're going to
have to do some work to backfill the sample with preexisting entities)
that involves exposing a uniform random sample of keys as an index so
that we can find split points easily. I think it should make this a
lot easier.


One other thought: instead of adding a GQL interpreter, you might just
add a hook for loading a class provided by the user. That class would
implement a Filter interface with a method that takes a Configuration
and returns a Query object so in your example, mailing and timestamp
would get passed in as Configuration parameters and a query object
corresponding to the GQL statement you put would be built by a Filter
class provided by the user. It would act kind of like a templating
language for building queries. Make sense/sound like a good idea?

On Nov 18, 7:01 am, Nacho Coloma <[email protected]> wrote:
> > I'm not entirely sure I understand
>
> the scope of the proposed patch. Are you thinking about adding filters
>
> > at the DatastoreRecordReader level? It's not entirely clear to me that
> > that provides a benefit over just applying the filter at the start of
> > the map() function. Totally willing to believe I'm missing something,
> > though.
>
> The map() filter runs against your quota. This is OK for once-only tasks
> such as schema upgrades, but Mappers can also be used for repetitive tasks
> such as mailing, data cleanup, etc. For these cases, being able to work on a
> subset of data is important (process only user accounts with mailing
> enabled, for example).
>
> The biggest problem to resolve is how to specify the filter clause in
> mapreduce.xml. I am considering implementing a GQL parser as simple as
> possible, and inject servlet request parameters. Something like:
>
> <property>
> <name>mapreduce.mapper.inputformat.datastoreinputformat.query</name>
> <value>select * from users where mailing=:value1 and
> timestamp<=:value2</value>
> </property>
>
> This implies porting the GQL implementation from python to Java, or
> implementing an ANTLR-based parser. I feel like I am reinventing the wheel,
> so any suggestion to use something that exists (or aim to a simpler design)
> is welcome.
>
> On a logistical note, for nontrivial contributions, we require a CLA
>
> > from either you or your employer (depending on who owns the copyright
> > for your work) before we can accept significant contributions. The
> > relevant forms are at:
> >http://code.google.com/legal/individual-cla-v1.0.html
> > andhttp://code.google.com/legal/corporate-cla-v1.0.html. Feel free to
> > email me privately if this is an issue.
>
> No problem with that.
>
> Regards,
>
> Nacho.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en.

[appengine-java] Re: mapreduce - passing filters

Reply via email to