[google-appengine] Re: Fan-in with materialized views: A sketch

Dmitry Wed, 03 Nov 2010 14:00:46 -0700

oops I read the expression in wrong direction:) yep, this will
definitely work!


On Nov 3, 7:43 pm, Robert Kluin <[email protected]> wrote:
> Dmitry,
>   Right, I know those will cause problems. So what about my suggested 
> solution of using:
>
>  if not re.match("^[a-zA-Z0-9-]+$", task_name):
>       task_name =  sha1_hash(task_name)
>
> That should correctly handle your use cases, since the full name will be 
> hashed.
>
> Are there issues with that solution I am not seeing?
>
> Robert
>
> On Nov 3, 2010, at 3:52, Dmitry <[email protected]> wrote:
>
> > Robert,
>
> > You will get into the trouble with these aggregations:
>
> > urls:
> > http://правительство.рф/search/?phrase=налог&section=gov_events ->
> > httpsearchphrase
> > http://правительство.рф/search/?phrase=президент&section=gov_events ->
> > httpsearchphrase
>
> > or usernames:
> > мститель2000 -> 2000
> > тест2000 -> 2000
>
> > but anyway in most cases your approach will work well:) You can leave
> > it up to the user (add some kind of flag "use_hash").
>
> > or we can try to url encode strings:
> > urllib.quote(task_name.encode('utf-8'))
> > http3AD0BFD180D0B0D0B2D0B8D182D0B5D0BBD18CD181D182D0B2D0BED180D184search3Fphrase3DD0BDD0B0D0BBD0BED0B3
> > http3AD0BFD180D0B0D0B2D0B8D182D0B5D0BBD18CD181D182D0B2D0BED180D184search3Fphrase3DD0BFD180D0B5D0B7D0B8D0B4D0B5D0BDD182
>
> > but this is not better that hash :-D
>
> > thanks
>
> > On Nov 3, 7:13 am, Robert Kluin <[email protected]> wrote:
> >> Hey Dmitry,
> >>   I am sure the "fix" in that commit is _not_ a good idea.  Originally
> >> I stuck it in because I use entity keys as the task-name, sometimes
> >> they contains characters not allowed in task-names.  I actually
> >> debated for several days about pushing that update out;  finally I
> >> decide to push and hope someone would notice and offer their thoughts.
>
> >>   I like your idea a lot.  But, for many aggregations I like to use
> >> entity keys, it makes it possible for me to visually see what a task
> >> is doing.  What do you think about something like the following
> >> approach:
>
> >>   if not re.match("^[a-zA-Z0-9-]+$", task_name):
> >>       task_name = sha1_hash(task_name)
>
> >> That should allow 'valid' names to remain as-is, but it will safely
> >> encode non-valid task-names.  Do you think that is an acceptable
> >> method?
>
> >> Thanks a lot for your feedback.
>
> >> Robert
>
> >> On Tue, Nov 2, 2010 at 07:15, Dmitry <[email protected]> wrote:
> >>> Hi Robert,
>
> >>> Regarding your latest commit:
>
> >>> # TODO: find a better solution for cleaning up the name.
> >>> task_name = re.sub('[^a-zA-Z0-9-]', '', task_name)[:500]
>
> >>> Don't think this is a good idea:) For example I have unicode
> >>> characters in aggregation value. In this case regexp will return
> >>> nothing.
> >>> I use sha1 hash now... but there's also a little possibility of
> >>> collision
>
> >>> sha1_hash(self.agg_name)
>
> >>> def utf8encoded(data):
> >>>  if data is None:
> >>>    return None
> >>>  if isinstance(data, unicode):
> >>>    return unicode(data).encode('utf-8')
> >>>  else:
> >>>    return data
>
> >>> def sha1_hash(value):
> >>>  return hashlib.sha1(utf8encoded(value)).hexdigest()
>
> >>> On Oct 24, 9:26 pm, Robert Kluin <[email protected]> wrote:
> >>>> Hi Dmitry,
> >>>>   Glad to hear it was helpful!  Not sure when you checked it out last,
> >>>> but I made a number of good (I think) improvements in the last couple
> >>>> days, such as continuations to allow splitting large groups of work
> >>>> up.
>
> >>>> Robert
>
> >>>> On Sun, Oct 24, 2010 at 07:57, Dmitry <[email protected]> wrote:
> >>>>> Robert,
>
> >>>>> You grouping_with_date_rollup.py example was extremely helpful. Thanks
> >>>>> a lot again! :)
>
> >>>>> On Oct 14, 8:47 pm, Robert Kluin <[email protected]> wrote:
> >>>>>> Hey Carles,
> >>>>>>   Glad it seems helpful.  I am hoping to get time today to push out
> >>>>>> some revisions and sample code.
>
> >>>>>> Robert
>
> >>>>>> On Thu, Oct 14, 2010 at 05:50, Carles Gonzalez <[email protected]> 
> >>>>>> wrote:
> >>>>>>> Robert, I took a brief inspection at your code and seems very cool. 
> >>>>>>> Exactly
> >>>>>>> what i was lloking for for my report generation and such.
> >>>>>>> I'm looking forward for more examples, but it seems a very valuable 
> >>>>>>> addition
> >>>>>>> for our toolbox.
> >>>>>>> Thanks a lot!
>
> >>>>>>> On Wed, Oct 13, 2010 at 9:20 PM, Carles Gonzalez <[email protected]> 
> >>>>>>> wrote:
>
> >>>>>>>> Neat! I'm going to see this code, hopefully I'll understand 
> >>>>>>>> something :)
> >>>>>>>> On Wednesday, October 13, 2010, Robert Kluin <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>> Hey Dmitry,
> >>>>>>>>>    In case it might help, I pushed some code to bitbucket.  At the
> >>>>>>>>> moment I would (personally) say the code is not too pretty, but it
> >>>>>>>>> works well.  :)
> >>>>>>>>>      http://bitbucket.org/thebobert/slagg
>
> >>>>>>>>>   Sorry it does not really have good documentation at the moment, 
> >>>>>>>>> but
> >>>>>>>>> I think the basic example I threw together will give you a good idea
> >>>>>>>>> of how to use it.  I need to do another cleanup pass over the API to
> >>>>>>>>> make a few more refinements.
>
> >>>>>>>>>    I pulled this code out of one of my apps, and tried to quickly
> >>>>>>>>> refactor it to be a bit more generic.  We are currently using
> >>>>>>>>> basically the same code in three apps to do some really complex
> >>>>>>>>> calculations.  As soon as I get time I will get an example up 
> >>>>>>>>> showing
> >>>>>>>>> how to use it for neat stuff, like overall, yearly, monthly, and 
> >>>>>>>>> daily
> >>>>>>>>> aggregates across multiple values (like total dollars and quantity).
> >>>>>>>>> The cool thing is that you can do all of those aggregations across
> >>>>>>>>> various groupings, like customer, company, contact, and 
> >>>>>>>>> sales-person,
> >>>>>>>>> at once.  I'll get that code pushed out in the next few days.
>
> >>>>>>>>>   Would love to get some feedback on it.
>
> >>>>>>>>> Robert
>
> >>>>>>>>> On Tue, Oct 12, 2010 at 17:26, Dmitry <[email protected]> 
> >>>>>>>>> wrote:
> >>>>>>>>>> Ben, thanks for your code! I'm trying to understand all this stuff
> >>>>>>>>>> too...
> >>>>>>>>>> Robert, any success with your "library"? May be you've already done
> >>>>>>>>>> all stuff we are trying to implement...
>
> >>>>>>>>>> p.s. where is Brett S.:) would like to hear his comments on this
>
> >>>>>>>>>> On Sep 21, 1:49 pm, Ben <[email protected]> wrote:
> >>>>>>>>>>> Thanks for your insights. I would love feedback on this 
> >>>>>>>>>>> implementation
> >>>>>>>>>>> (Brett S. suggested we send in our code for
> >>>>>>>>>>> this)http://pastebin.com/3pUhFdk8
>
> >>>>>>>>>>> This implementation is for just one materialized view row at a 
> >>>>>>>>>>> time
> >>>>>>>>>>> (e.g. a simple counter, no presence markers). Hopefully putting 
> >>>>>>>>>>> an ETA
> >>>>>>>>>>> on the transactional task will relieve the write pressure, since
> >>>>>>>>>>> usually it should be an old update with an out-of-date sequence 
> >>>>>>>>>>> number
> >>>>>>>>>>> and be discarded (the update having already been completed in 
> >>>>>>>>>>> batch by
> >>>>>>>>>>> the fork-join-queue).
>
> >>>>>>>>>>> I'd love to generalize this to do more than one materialized view 
> >>>>>>>>>>> row
> >>>>>>>>>>> but thought I'd get feedback first.
>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Ben
>
> >>>>>>>>>>> On Sep 17, 7:30 am, Robert Kluin <[email protected]> wrote:
>
> >>>>>>>>>>>> Responses inline.
>
> >>>>>>>>>>>> On Thu, Sep 16, 2010 at 17:32, Ben <[email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>> I have a question about Brett Slatkin's talk at I/O 2010 on data
> >>>>>>>>>>>>> pipelines. The question is about slide #67 of his pdf,
> >>>>>>>>>>>>> corresponding
> >>>>>>>>>>>>> to minute 51:30 of his talk
>
> >>>>>>>>>>>>>>http://code.google.com/events/io/2010/sessions/high-throughput-data-p...
>
> >>>>>>>>>>>>> I am wondering what is supposed to happen in the transactional
> >>>>>>>>>>>>> task
> >>>>>>>>>>>>> (bullet point 2c). Would these updates to the materialized view
> >>>>>>>>>>>>> cause
> >>>>>>>>>>>>> you to write too frequently to the entity group containing the
> >>>>>>>>>>>>> materialized view?
>
> >>>>>>>>>>>> I think there are really two different approaches you can use to
> >>>>>>>>>>>> insert your work models.
> >>>>>>>>>>>> 1)  The work models get added to the original entity's group.  
> >>>>>>>>>>>> So,
> >>>>>>>>>>>> inside of the original transaction you do not write to the entity
> >>>>>>>>>>>> group containing the materialized view -- so no contention on it.
> >>>>>>>>>>>> Commit the transaction and proceed to step 3.
> >>>>>>>>>>>> 2)  You kick off a transactional task to insert the work model, 
> >>>>>>>>>>>> or
> >>>>>>>>>>>> fan-out more tasks to create work models  :).   Then you proceed 
> >>>>>>>>>>>> to
> >>>>>>>>>>>> step 3.
>
> >>>>>>>>>>>> You can use method 1 if you have only a few aggregates.  If you 
> >>>>>>>>>>>> have
> >>>>>>>>>>>> more aggregates use the second method.  I have a "library" I am
> >>>>>>>>>>>> almost
> >>>>>>>>>>>> ready to open source that makes method 2 really easy, so you can
> >>>>>>>>>>>> have
> >>>>>>>>>>>> lots of aggregates.  I'll post to this group when I release it.
>
> >>>>>>>>>>>>> And a related question, what happens if there is a failure just
> >>>>>>>>>>>>> after
> >>>>>>>>>>>>> the transaction in bullet #2, but right before the named task 
> >>>>>>>>>>>>> gets
> >>>>>>>>>>>>> inserted in bullet #3. In my current implementation I just left
> >>>>>>>>>>>>> out
> >>>>>>>>>>>>> the transactional task (bullet point 2c) but I think that causes
> >>>>>>>>>>>>> me to
> >>>>>>>>>>>>> lose the eventual consistency.
>
> >>>>>>>>>>>> Failure between steps 2 and 3 just means _that_ particular update
> >>>>>>>>>>>> will
> >>>>>>>>>>>> not try to kick-off, ie insert, the fan-in (aggregation) task.  
> >>>>>>>>>>>> But
> >>>>>>>>>>>> it
> >>>>>>>>>>>> might have already been inserted by the previous update, or the 
> >>>>>>>>>>>> next
> >>>>>>>>>>>> update.  However, if nothing else kicks of the fan-in task you 
> >>>>>>>>>>>> will
> >>>>>>>>>>>> need some periodic "cleanup" method to catch the update and kick 
> >>>>>>>>>>>> of
> >>>>>>>>>>>> the fan-in task.  Depending on exactly how you implemented step 2
> >>>>>>>>>>>> you
> >>>>>>>>>>>> may not need a transactional task.
>
> >>>>>>>>>>>> Robert
>
> >>>>>>>>>>>>> Thanks!
>
> >>>>>>> --
> >>>>>>> You received this message because you are subscribed to the Google 
> >>>>>>> Groups
> >>>>>>> "Google App Engine" group.
>
> ...
>
> read more >>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Fan-in with materialized views: A sketch

Reply via email to