Re: [google-appengine] Re: Fan-in with materialized views: A sketch

Robert Kluin Mon, 08 Nov 2010 16:14:24 -0800

Hey Dmitry,
   I am working on getting some decent documentation about when you
might want to use fanout versus directly using creatwork.  And, about
usage in general.  If I am dealing with one or two aggregations I
usually use creatework directly.  You can only insert five
transactional tasks in one database transaction, so with four you
could directly use creatework eliminating a fanout task.


  As far as rates go, I have been using a rate of 35/s and bucket size
of 40.  However, I also get periodic queue backups.  I think the max
rate / sec is currently 50, but I thought there was an announcement it
was getting increased (maybe I am just remembering the increase to
50/sec announcement though).  You might want to bump your rate up to
50/sec.  I always use a dedicated queue for creatework and aggregation
tasks.  In one of my apps I use multiple queues to get a bit higher
throughput.

  I generally prefer to use creatework tasks; they cleanly handle any
failures that occur and keeps my primary processing running as fast as
possible.  However, when I first started using this type of
aggregation technique I created the 'work' models and attempted to
insert the aggregator task (non-transactionaly!) within my primary
transaction.  If your primary processing is within tasks, and your
tasks are fast enough, give it a shot.  Converting CreateWorkHandler
to something you can use directly should not be a big deal.



Robert







On Mon, Nov 8, 2010 at 18:14, Dmitry <[email protected]> wrote:
> Hi Robert,
>
> What queue configuration do you use for your system?
> I came to another problem. I usually process several feeds in parallel
> and can insert up to 20-30 new items to the database. With 4
> aggregators it's >80 create_work tasks in one moment. So after a
> minute I can have up to 1000 tasks in queue... so I have up to 5
> minutes delay in processing.
>
> It seems that for initial aggregation I should insert create work
> models not in tasks.
> I messed up again:)
>
> On Nov 5, 6:46 am, Robert Kluin <[email protected]> wrote:
>> Dmitry,
>>    I finally got the time to make these changes.  Let me know if that
>> works for your use-case.
>>
>>    I really appreciate all of your suggestions and help with this.
>>
>> Robert
>>
>> 2010/11/3 Dmitry <[email protected]>:
>>
>> > oops I read expression in wrong direction. This will definitely work!
>>
>> > On Nov 3, 7:43 pm, Robert Kluin <[email protected]> wrote:
>> >> Dmitry,
>> >> š Right, I know those will cause problems. So what about my suggested 
>> >> solution of using:
>>
>> >> šif not re.match("^[a-zA-Z0-9-]+$", task_name):
>> >> š š š task_name = šsha1_hash(task_name)
>>
>> >> That should correctly handle your use cases, since the full name will be 
>> >> hashed.
>>
>> >> Are there issues with that solution I am not seeing?
>>
>> >> Robert
>>
>> >> On Nov 3, 2010, at 3:52, Dmitry <[email protected]> wrote:
>>
>> >> > Robert,
>>
>> >> > You will get into the trouble with these aggregations:
>>
>> >> > urls:
>> >> > http://ÐÒÁ×ÉÔÅÌØÓÔ×Ï.ÒÆ/search/?phrase=ÎÁÌÏÇ&section=gov_events ->
>> >> > httpsearchphrase
>> >> > http://ÐÒÁ×ÉÔÅÌØÓÔ×Ï.ÒÆ/search/?phrase=ÐÒÅÚÉÄÅÎÔ&section=gov_events ->
>> >> > httpsearchphrase
>>
>> >> > or usernames:
>> >> > ÍÓÔÉÔÅÌØ2000 -> 2000
>> >> > ÔÅÓÔ2000 -> 2000
>>
>> >> > but anyway in most cases your approach will work well:) You can leave
>> >> > it up to the user (add some kind of flag "use_hash").
>>
>> >> > or we can try to url encode strings:
>> >> > urllib.quote(task_name.encode('utf-8'))
>> >> > http3AD0BFD180D0B0D0B2D0B8D182D0B5D0BBD18CD181D182D0B2D0BED180D184search3Fphrase3DD0BDD0B0D0BBD0BED0B3
>> >> > http3AD0BFD180D0B0D0B2D0B8D182D0B5D0BBD18CD181D182D0B2D0BED180D184search3Fphrase3DD0BFD180D0B5D0B7D0B8D0B4D0B5D0BDD182
>>
>> >> > but this is not better that hash :-D
>>
>> >> > thanks
>>
>> >> > On Nov 3, 7:13 am, Robert Kluin <[email protected]> wrote:
>> >> >> Hey Dmitry,
>> >> >> š I am sure the "fix" in that commit is _not_ a good idea. šOriginally
>> >> >> I stuck it in because I use entity keys as the task-name, sometimes
>> >> >> they contains characters not allowed in task-names. šI actually
>> >> >> debated for several days about pushing that update out; šfinally I
>> >> >> decide to push and hope someone would notice and offer their thoughts.
>>
>> >> >> š I like your idea a lot. šBut, for many aggregations I like to use
>> >> >> entity keys, it makes it possible for me to visually see what a task
>> >> >> is doing. šWhat do you think about something like the following
>> >> >> approach:
>>
>> >> >> š if not re.match("^[a-zA-Z0-9-]+$", task_name):
>> >> >> š š š task_name = sha1_hash(task_name)
>>
>> >> >> That should allow 'valid' names to remain as-is, but it will safely
>> >> >> encode non-valid task-names. šDo you think that is an acceptable
>> >> >> method?
>>
>> >> >> Thanks a lot for your feedback.
>>
>> >> >> Robert
>>
>> >> >> On Tue, Nov 2, 2010 at 07:15, Dmitry <[email protected]> wrote:
>> >> >>> Hi Robert,
>>
>> >> >>> Regarding your latest commit:
>>
>> >> >>> # TODO: find a better solution for cleaning up the name.
>> >> >>> task_name = re.sub('[^a-zA-Z0-9-]', '', task_name)[:500]
>>
>> >> >>> Don't think this is a good idea:) For example I have unicode
>> >> >>> characters in aggregation value. In this case regexp will return
>> >> >>> nothing.
>> >> >>> I use sha1 hash now... but there's also a little possibility of
>> >> >>> collision
>>
>> >> >>> sha1_hash(self.agg_name)
>>
>> >> >>> def utf8encoded(data):
>> >> >>> šif data is None:
>> >> >>> š šreturn None
>> >> >>> šif isinstance(data, unicode):
>> >> >>> š šreturn unicode(data).encode('utf-8')
>> >> >>> šelse:
>> >> >>> š šreturn data
>>
>> >> >>> def sha1_hash(value):
>> >> >>> šreturn hashlib.sha1(utf8encoded(value)).hexdigest()
>>
>> >> >>> On Oct 24, 9:26 pm, Robert Kluin <[email protected]> wrote:
>> >> >>>> Hi Dmitry,
>> >> >>>> š Glad to hear it was helpful! šNot sure when you checked it out 
>> >> >>>> last,
>> >> >>>> but I made a number of good (I think) improvements in the last couple
>> >> >>>> days, such as continuations to allow splitting large groups of work
>> >> >>>> up.
>>
>> >> >>>> Robert
>>
>> >> >>>> On Sun, Oct 24, 2010 at 07:57, Dmitry <[email protected]> 
>> >> >>>> wrote:
>> >> >>>>> Robert,
>>
>> >> >>>>> You grouping_with_date_rollup.py example was extremely helpful. 
>> >> >>>>> Thanks
>> >> >>>>> a lot again! :)
>>
>> >> >>>>> On Oct 14, 8:47 pm, Robert Kluin <[email protected]> wrote:
>> >> >>>>>> Hey Carles,
>> >> >>>>>> š Glad it seems helpful. šI am hoping to get time today to push out
>> >> >>>>>> some revisions and sample code.
>>
>> >> >>>>>> Robert
>>
>> >> >>>>>> On Thu, Oct 14, 2010 at 05:50, Carles Gonzalez 
>> >> >>>>>> <[email protected]> wrote:
>> >> >>>>>>> Robert, I took a brief inspection at your code and seems very 
>> >> >>>>>>> cool. Exactly
>> >> >>>>>>> what i was lloking for for my report generation and such.
>> >> >>>>>>> I'm looking forward for more examples, but it seems a very 
>> >> >>>>>>> valuable addition
>> >> >>>>>>> for our toolbox.
>> >> >>>>>>> Thanks a lot!
>>
>> >> >>>>>>> On Wed, Oct 13, 2010 at 9:20 PM, Carles Gonzalez 
>> >> >>>>>>> <[email protected]> wrote:
>>
>> >> >>>>>>>> Neat! I'm going to see this code, hopefully I'll understand 
>> >> >>>>>>>> something :)
>> >> >>>>>>>> On Wednesday, October 13, 2010, Robert Kluin 
>> >> >>>>>>>> <[email protected]>
>> >> >>>>>>>> wrote:
>> >> >>>>>>>>> Hey Dmitry,
>> >> >>>>>>>>> š šIn case it might help, I pushed some code to bitbucket. šAt 
>> >> >>>>>>>>> the
>> >> >>>>>>>>> moment I would (personally) say the code is not too pretty, but 
>> >> >>>>>>>>> it
>> >> >>>>>>>>> works well. š:)
>> >> >>>>>>>>> š š šhttp://bitbucket.org/thebobert/slagg
>>
>> >> >>>>>>>>> š Sorry it does not really have good documentation at the 
>> >> >>>>>>>>> moment, but
>> >> >>>>>>>>> I think the basic example I threw together will give you a good 
>> >> >>>>>>>>> idea
>> >> >>>>>>>>> of how to use it. šI need to do another cleanup pass over the 
>> >> >>>>>>>>> API to
>> >> >>>>>>>>> make a few more refinements.
>>
>> >> >>>>>>>>> š šI pulled this code out of one of my apps, and tried to 
>> >> >>>>>>>>> quickly
>> >> >>>>>>>>> refactor it to be a bit more generic. šWe are currently using
>> >> >>>>>>>>> basically the same code in three apps to do some really complex
>> >> >>>>>>>>> calculations. šAs soon as I get time I will get an example up 
>> >> >>>>>>>>> showing
>> >> >>>>>>>>> how to use it for neat stuff, like overall, yearly, monthly, 
>> >> >>>>>>>>> and daily
>> >> >>>>>>>>> aggregates across multiple values (like total dollars and 
>> >> >>>>>>>>> quantity).
>> >> >>>>>>>>> The cool thing is that you can do all of those aggregations 
>> >> >>>>>>>>> across
>> >> >>>>>>>>> various groupings, like customer, company, contact, and 
>> >> >>>>>>>>> sales-person,
>> >> >>>>>>>>> at once. šI'll get that code pushed out in the next few days.
>>
>> >> >>>>>>>>> š Would love to get some feedback on it.
>>
>> >> >>>>>>>>> Robert
>>
>> >> >>>>>>>>> On Tue, Oct 12, 2010 at 17:26, Dmitry 
>> >> >>>>>>>>> <[email protected]> wrote:
>> >> >>>>>>>>>> Ben, thanks for your code! I'm trying to understand all this 
>> >> >>>>>>>>>> stuff
>> >> >>>>>>>>>> too...
>> >> >>>>>>>>>> Robert, any success with your "library"? May be you've already 
>> >> >>>>>>>>>> done
>> >> >>>>>>>>>> all stuff we are trying to implement...
>>
>> >> >>>>>>>>>> p.s. where is Brett S.:) would like to hear his comments on 
>> >> >>>>>>>>>> this
>>
>> >> >>>>>>>>>> On Sep 21, 1:49 pm, Ben <[email protected]> wrote:
>> >> >>>>>>>>>>> Thanks for your insights. I would love feedback on this 
>> >> >>>>>>>>>>> implementation
>> >> >>>>>>>>>>> (Brett S. suggested we send in our code for
>> >> >>>>>>>>>>> this)http://pastebin.com/3pUhFdk8
>>
>> >> >>>>>>>>>>> This implementation is for just one materialized view row at 
>> >> >>>>>>>>>>> a time
>> >> >>>>>>>>>>> (e.g. a simple counter, no presence markers). Hopefully 
>> >> >>>>>>>>>>> putting an ETA
>> >> >>>>>>>>>>> on the transactional task will relieve the write pressure, 
>> >> >>>>>>>>>>> since
>> >> >>>>>>>>>>> usually it should be an old update with an out-of-date 
>> >> >>>>>>>>>>> sequence number
>> >> >>>>>>>>>>> and be discarded (the update having already been completed in 
>> >> >>>>>>>>>>> batch by
>> >> >>>>>>>>>>> the fork-join-queue).
>>
>> >> >>>>>>>>>>> I'd love to generalize this to do more than one materialized 
>> >> >>>>>>>>>>> view row
>> >> >>>>>>>>>>> but thought I'd get feedback first.
>>
>> >> >>>>>>>>>>> Thanks,
>> >> >>>>>>>>>>> Ben
>>
>> >> >>>>>>>>>>> On Sep 17, 7:30 am, Robert Kluin <[email protected]> 
>> >> >>>>>>>>>>> wrote:
>>
>> >> >>>>>>>>>>>> Responses inline.
>>
>> >> >>>>>>>>>>>> On Thu, Sep 16, 2010 at 17:32, Ben 
>> >> >>>>>>>>>>>> <[email protected]>
>> >> >>>>>>>>>>>> wrote:
>> >> >>>>>>>>>>>>> I have a question about Brett Slatkin's talk at I/O 2010 on 
>> >> >>>>>>>>>>>>> data
>> >> >>>>>>>>>>>>> pipelines. The question is about slide #67 of his pdf,
>> >> >>>>>>>>>>>>> corresponding
>> >> >>>>>>>>>>>>> to minute 51:30 of his talk
>>
>> >> >>>>>>>>>>>>>>http://code.google.com/events/io/2010/sessions/high-throughput-data-p...
>>
>> >> >>>>>>>>>>>>> I am wondering what is supposed to happen in the 
>> >> >>>>>>>>>>>>> transactional
>> >> >>>>>>>>>>>>> task
>> >> >>>>>>>>>>>>> (bullet point 2c). Would these updates to the materialized 
>> >> >>>>>>>>>>>>> view
>> >> >>>>>>>>>>>>> cause
>> >> >>>>>>>>>>>>> you to write too frequently to the entity group containing 
>> >> >>>>>>>>>>>>> the
>> >> >>>>>>>>>>>>> materialized view?
>>
>> >> >>>>>>>>>>>> I think there are really two different approaches you can 
>> >> >>>>>>>>>>>> use to
>> >> >>>>>>>>>>>> insert your work models.
>> >> >>>>>>>>>>>> 1) šThe work models get added to the original entity's 
>> >> >>>>>>>>>>>> group. šSo,
>> >> >>>>>>>>>>>> inside of the original transaction you do not write to the 
>> >> >>>>>>>>>>>> entity
>> >> >>>>>>>>>>>> group containing the materialized view -- so no contention 
>> >> >>>>>>>>>>>> on it.
>> >> >>>>>>>>>>>> Commit the transaction and proceed to step 3.
>> >> >>>>>>>>>>>> 2) šYou kick off a transactional task to insert the work 
>> >> >>>>>>>>>>>> model, or
>> >> >>>>>>>>>>>> fan-out more tasks to create work models š:). š Then you 
>> >> >>>>>>>>>>>> proceed to
>> >> >>>>>>>>>>>> step 3.
>>
>> >> >>>>>>>>>>>> You can use method 1 if you have only a few aggregates. šIf 
>> >> >>>>>>>>>>>> you have
>> >> >>>>>>>>>>>> more aggregates use the second method. šI have a "library" I 
>> >> >>>>>>>>>>>> am
>> >> >>>>>>>>>>>> almost
>> >> >>>>>>>>>>>> ready to open source that makes method 2 really easy, so you 
>> >> >>>>>>>>>>>> can
>> >> >>>>>>>>>>>> have
>> >> >>>>>>>>>>>> lots of aggregates. šI'll post to this group when I release 
>> >> >>>>>>>>>>>> it.
>>
>> >> >>>>>>>>>>>>> And a related question, what happens if there is a failure 
>> >> >>>>>>>>>>>>> just
>> >> >>>>>>>>>>>>> after
>> >> >>>>>>>>>>>>> the transaction in bullet #2, but right before the named 
>> >> >>>>>>>>>>>>> task gets
>> >> >>>>>>>>>>>>> inserted in bullet #3. In my current implementation I just 
>> >> >>>>>>>>>>>>> left
>> >> >>>>>>>>>>>>> out
>> >> >>>>>>>>>>>>> the
>>
>> ...
>>
>> read more »
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: Fan-in with materialized views: A sketch

Reply via email to