[google-appengine] Re: Fan-in with materialized views: A sketch

Dmitry Mon, 08 Nov 2010 15:14:54 -0800

Hi Robert,

What queue configuration do you use for your system?
I came to another problem. I usually process several feeds in parallel
and can insert up to 20-30 new items to the database. With 4
aggregators it's >80 create_work tasks in one moment. So after a
minute I can have up to 1000 tasks in queue... so I have up to 5
minutes delay in processing.


It seems that for initial aggregation I should insert create work
models not in tasks.
I messed up again:)

On Nov 5, 6:46 am, Robert Kluin <[email protected]> wrote:
> Dmitry,
>    I finally got the time to make these changes.  Let me know if that
> works for your use-case.
>
>    I really appreciate all of your suggestions and help with this.
>
> Robert
>
> 2010/11/3 Dmitry <[email protected]>:
>
> > oops I read expression in wrong direction. This will definitely work!
>
> > On Nov 3, 7:43 pm, Robert Kluin <[email protected]> wrote:
> >> Dmitry,
> >> š Right, I know those will cause problems. So what about my suggested 
> >> solution of using:
>
> >> šif not re.match("^[a-zA-Z0-9-]+$", task_name):
> >> š š š task_name = šsha1_hash(task_name)
>
> >> That should correctly handle your use cases, since the full name will be 
> >> hashed.
>
> >> Are there issues with that solution I am not seeing?
>
> >> Robert
>
> >> On Nov 3, 2010, at 3:52, Dmitry <[email protected]> wrote:
>
> >> > Robert,
>
> >> > You will get into the trouble with these aggregations:
>
> >> > urls:
> >> > http://ÐÒÁ×ÉÔÅÌØÓÔ×Ï.ÒÆ/search/?phrase=ÎÁÌÏÇ&section=gov_events ->
> >> > httpsearchphrase
> >> > http://ÐÒÁ×ÉÔÅÌØÓÔ×Ï.ÒÆ/search/?phrase=ÐÒÅÚÉÄÅÎÔ&section=gov_events ->
> >> > httpsearchphrase
>
> >> > or usernames:
> >> > ÍÓÔÉÔÅÌØ2000 -> 2000
> >> > ÔÅÓÔ2000 -> 2000
>
> >> > but anyway in most cases your approach will work well:) You can leave
> >> > it up to the user (add some kind of flag "use_hash").
>
> >> > or we can try to url encode strings:
> >> > urllib.quote(task_name.encode('utf-8'))
> >> > http3AD0BFD180D0B0D0B2D0B8D182D0B5D0BBD18CD181D182D0B2D0BED180D184search3Fphrase3DD0BDD0B0D0BBD0BED0B3
> >> > http3AD0BFD180D0B0D0B2D0B8D182D0B5D0BBD18CD181D182D0B2D0BED180D184search3Fphrase3DD0BFD180D0B5D0B7D0B8D0B4D0B5D0BDD182
>
> >> > but this is not better that hash :-D
>
> >> > thanks
>
> >> > On Nov 3, 7:13 am, Robert Kluin <[email protected]> wrote:
> >> >> Hey Dmitry,
> >> >> š I am sure the "fix" in that commit is _not_ a good idea. šOriginally
> >> >> I stuck it in because I use entity keys as the task-name, sometimes
> >> >> they contains characters not allowed in task-names. šI actually
> >> >> debated for several days about pushing that update out; šfinally I
> >> >> decide to push and hope someone would notice and offer their thoughts.
>
> >> >> š I like your idea a lot. šBut, for many aggregations I like to use
> >> >> entity keys, it makes it possible for me to visually see what a task
> >> >> is doing. šWhat do you think about something like the following
> >> >> approach:
>
> >> >> š if not re.match("^[a-zA-Z0-9-]+$", task_name):
> >> >> š š š task_name = sha1_hash(task_name)
>
> >> >> That should allow 'valid' names to remain as-is, but it will safely
> >> >> encode non-valid task-names. šDo you think that is an acceptable
> >> >> method?
>
> >> >> Thanks a lot for your feedback.
>
> >> >> Robert
>
> >> >> On Tue, Nov 2, 2010 at 07:15, Dmitry <[email protected]> wrote:
> >> >>> Hi Robert,
>
> >> >>> Regarding your latest commit:
>
> >> >>> # TODO: find a better solution for cleaning up the name.
> >> >>> task_name = re.sub('[^a-zA-Z0-9-]', '', task_name)[:500]
>
> >> >>> Don't think this is a good idea:) For example I have unicode
> >> >>> characters in aggregation value. In this case regexp will return
> >> >>> nothing.
> >> >>> I use sha1 hash now... but there's also a little possibility of
> >> >>> collision
>
> >> >>> sha1_hash(self.agg_name)
>
> >> >>> def utf8encoded(data):
> >> >>> šif data is None:
> >> >>> š šreturn None
> >> >>> šif isinstance(data, unicode):
> >> >>> š šreturn unicode(data).encode('utf-8')
> >> >>> šelse:
> >> >>> š šreturn data
>
> >> >>> def sha1_hash(value):
> >> >>> šreturn hashlib.sha1(utf8encoded(value)).hexdigest()
>
> >> >>> On Oct 24, 9:26 pm, Robert Kluin <[email protected]> wrote:
> >> >>>> Hi Dmitry,
> >> >>>> š Glad to hear it was helpful! šNot sure when you checked it out last,
> >> >>>> but I made a number of good (I think) improvements in the last couple
> >> >>>> days, such as continuations to allow splitting large groups of work
> >> >>>> up.
>
> >> >>>> Robert
>
> >> >>>> On Sun, Oct 24, 2010 at 07:57, Dmitry <[email protected]> 
> >> >>>> wrote:
> >> >>>>> Robert,
>
> >> >>>>> You grouping_with_date_rollup.py example was extremely helpful. 
> >> >>>>> Thanks
> >> >>>>> a lot again! :)
>
> >> >>>>> On Oct 14, 8:47 pm, Robert Kluin <[email protected]> wrote:
> >> >>>>>> Hey Carles,
> >> >>>>>> š Glad it seems helpful. šI am hoping to get time today to push out
> >> >>>>>> some revisions and sample code.
>
> >> >>>>>> Robert
>
> >> >>>>>> On Thu, Oct 14, 2010 at 05:50, Carles Gonzalez <[email protected]> 
> >> >>>>>> wrote:
> >> >>>>>>> Robert, I took a brief inspection at your code and seems very 
> >> >>>>>>> cool. Exactly
> >> >>>>>>> what i was lloking for for my report generation and such.
> >> >>>>>>> I'm looking forward for more examples, but it seems a very 
> >> >>>>>>> valuable addition
> >> >>>>>>> for our toolbox.
> >> >>>>>>> Thanks a lot!
>
> >> >>>>>>> On Wed, Oct 13, 2010 at 9:20 PM, Carles Gonzalez 
> >> >>>>>>> <[email protected]> wrote:
>
> >> >>>>>>>> Neat! I'm going to see this code, hopefully I'll understand 
> >> >>>>>>>> something :)
> >> >>>>>>>> On Wednesday, October 13, 2010, Robert Kluin 
> >> >>>>>>>> <[email protected]>
> >> >>>>>>>> wrote:
> >> >>>>>>>>> Hey Dmitry,
> >> >>>>>>>>> š šIn case it might help, I pushed some code to bitbucket. šAt 
> >> >>>>>>>>> the
> >> >>>>>>>>> moment I would (personally) say the code is not too pretty, but 
> >> >>>>>>>>> it
> >> >>>>>>>>> works well. š:)
> >> >>>>>>>>> š š šhttp://bitbucket.org/thebobert/slagg
>
> >> >>>>>>>>> š Sorry it does not really have good documentation at the 
> >> >>>>>>>>> moment, but
> >> >>>>>>>>> I think the basic example I threw together will give you a good 
> >> >>>>>>>>> idea
> >> >>>>>>>>> of how to use it. šI need to do another cleanup pass over the 
> >> >>>>>>>>> API to
> >> >>>>>>>>> make a few more refinements.
>
> >> >>>>>>>>> š šI pulled this code out of one of my apps, and tried to quickly
> >> >>>>>>>>> refactor it to be a bit more generic. šWe are currently using
> >> >>>>>>>>> basically the same code in three apps to do some really complex
> >> >>>>>>>>> calculations. šAs soon as I get time I will get an example up 
> >> >>>>>>>>> showing
> >> >>>>>>>>> how to use it for neat stuff, like overall, yearly, monthly, and 
> >> >>>>>>>>> daily
> >> >>>>>>>>> aggregates across multiple values (like total dollars and 
> >> >>>>>>>>> quantity).
> >> >>>>>>>>> The cool thing is that you can do all of those aggregations 
> >> >>>>>>>>> across
> >> >>>>>>>>> various groupings, like customer, company, contact, and 
> >> >>>>>>>>> sales-person,
> >> >>>>>>>>> at once. šI'll get that code pushed out in the next few days.
>
> >> >>>>>>>>> š Would love to get some feedback on it.
>
> >> >>>>>>>>> Robert
>
> >> >>>>>>>>> On Tue, Oct 12, 2010 at 17:26, Dmitry 
> >> >>>>>>>>> <[email protected]> wrote:
> >> >>>>>>>>>> Ben, thanks for your code! I'm trying to understand all this 
> >> >>>>>>>>>> stuff
> >> >>>>>>>>>> too...
> >> >>>>>>>>>> Robert, any success with your "library"? May be you've already 
> >> >>>>>>>>>> done
> >> >>>>>>>>>> all stuff we are trying to implement...
>
> >> >>>>>>>>>> p.s. where is Brett S.:) would like to hear his comments on this
>
> >> >>>>>>>>>> On Sep 21, 1:49 pm, Ben <[email protected]> wrote:
> >> >>>>>>>>>>> Thanks for your insights. I would love feedback on this 
> >> >>>>>>>>>>> implementation
> >> >>>>>>>>>>> (Brett S. suggested we send in our code for
> >> >>>>>>>>>>> this)http://pastebin.com/3pUhFdk8
>
> >> >>>>>>>>>>> This implementation is for just one materialized view row at a 
> >> >>>>>>>>>>> time
> >> >>>>>>>>>>> (e.g. a simple counter, no presence markers). Hopefully 
> >> >>>>>>>>>>> putting an ETA
> >> >>>>>>>>>>> on the transactional task will relieve the write pressure, 
> >> >>>>>>>>>>> since
> >> >>>>>>>>>>> usually it should be an old update with an out-of-date 
> >> >>>>>>>>>>> sequence number
> >> >>>>>>>>>>> and be discarded (the update having already been completed in 
> >> >>>>>>>>>>> batch by
> >> >>>>>>>>>>> the fork-join-queue).
>
> >> >>>>>>>>>>> I'd love to generalize this to do more than one materialized 
> >> >>>>>>>>>>> view row
> >> >>>>>>>>>>> but thought I'd get feedback first.
>
> >> >>>>>>>>>>> Thanks,
> >> >>>>>>>>>>> Ben
>
> >> >>>>>>>>>>> On Sep 17, 7:30 am, Robert Kluin <[email protected]> 
> >> >>>>>>>>>>> wrote:
>
> >> >>>>>>>>>>>> Responses inline.
>
> >> >>>>>>>>>>>> On Thu, Sep 16, 2010 at 17:32, Ben 
> >> >>>>>>>>>>>> <[email protected]>
> >> >>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>> I have a question about Brett Slatkin's talk at I/O 2010 on 
> >> >>>>>>>>>>>>> data
> >> >>>>>>>>>>>>> pipelines. The question is about slide #67 of his pdf,
> >> >>>>>>>>>>>>> corresponding
> >> >>>>>>>>>>>>> to minute 51:30 of his talk
>
> >> >>>>>>>>>>>>>>http://code.google.com/events/io/2010/sessions/high-throughput-data-p...
>
> >> >>>>>>>>>>>>> I am wondering what is supposed to happen in the 
> >> >>>>>>>>>>>>> transactional
> >> >>>>>>>>>>>>> task
> >> >>>>>>>>>>>>> (bullet point 2c). Would these updates to the materialized 
> >> >>>>>>>>>>>>> view
> >> >>>>>>>>>>>>> cause
> >> >>>>>>>>>>>>> you to write too frequently to the entity group containing 
> >> >>>>>>>>>>>>> the
> >> >>>>>>>>>>>>> materialized view?
>
> >> >>>>>>>>>>>> I think there are really two different approaches you can use 
> >> >>>>>>>>>>>> to
> >> >>>>>>>>>>>> insert your work models.
> >> >>>>>>>>>>>> 1) šThe work models get added to the original entity's group. 
> >> >>>>>>>>>>>> šSo,
> >> >>>>>>>>>>>> inside of the original transaction you do not write to the 
> >> >>>>>>>>>>>> entity
> >> >>>>>>>>>>>> group containing the materialized view -- so no contention on 
> >> >>>>>>>>>>>> it.
> >> >>>>>>>>>>>> Commit the transaction and proceed to step 3.
> >> >>>>>>>>>>>> 2) šYou kick off a transactional task to insert the work 
> >> >>>>>>>>>>>> model, or
> >> >>>>>>>>>>>> fan-out more tasks to create work models š:). š Then you 
> >> >>>>>>>>>>>> proceed to
> >> >>>>>>>>>>>> step 3.
>
> >> >>>>>>>>>>>> You can use method 1 if you have only a few aggregates. šIf 
> >> >>>>>>>>>>>> you have
> >> >>>>>>>>>>>> more aggregates use the second method. šI have a "library" I 
> >> >>>>>>>>>>>> am
> >> >>>>>>>>>>>> almost
> >> >>>>>>>>>>>> ready to open source that makes method 2 really easy, so you 
> >> >>>>>>>>>>>> can
> >> >>>>>>>>>>>> have
> >> >>>>>>>>>>>> lots of aggregates. šI'll post to this group when I release 
> >> >>>>>>>>>>>> it.
>
> >> >>>>>>>>>>>>> And a related question, what happens if there is a failure 
> >> >>>>>>>>>>>>> just
> >> >>>>>>>>>>>>> after
> >> >>>>>>>>>>>>> the transaction in bullet #2, but right before the named 
> >> >>>>>>>>>>>>> task gets
> >> >>>>>>>>>>>>> inserted in bullet #3. In my current implementation I just 
> >> >>>>>>>>>>>>> left
> >> >>>>>>>>>>>>> out
> >> >>>>>>>>>>>>> the
>
> ...
>
> read more »

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Fan-in with materialized views: A sketch

Reply via email to