Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Cal Leeming [Simplicity Media Ltd] Wed, 22 Jun 2011 06:53:09 -0700

Sorry, let me explain a little better.

(51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked',
49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]


map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked',
49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]

In the above example, it has found 49659 rows which need 'is_checked'
changing to the value '1' (same principle applied to the other 3), giving a
total of 51,130 database writes, split into 4 queries.

Those 4 fields have the IDs assigned to them:

                                    if _f == 'block_images':

_obj_incs.get('is_image_blocked').append(_hit_id)
                                        if _parent_id:

_obj_incs.get('is_image_blocked').append(_parent_id)

Then I loop through those fields, and do an update() using the necessary
IDs:

                    # now apply the obj changes in bulk (massive speed
improvements)
                    for _key, _value in _obj_incs.iteritems():
                        # update the child object
                        Post.objects.filter(
                            id__in = _value
                        ).update(
                            **{
                                _key : 1
                            }
                        )

So in simple terms, we're not doing 51 thousand update queries, instead
we're grouping them into bulk queries based on the row to be updated. It
doesn't yet to grouping based on key AND value, simply because we didn't
need it at the time, but if we release the code for public use,
we'd definitely add this in.

Hope this makes sense, let me know if I didn't explain it very well lol.

Cal

On Wed, Jun 22, 2011 at 2:45 PM, Thomas Weholt <[email protected]>wrote:

> On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
> <[email protected]> wrote:
> > Hey Thomas,
> > Yeah we actually spoke a little while ago about DSE. In the end, we
> actually
> > used a custom approach which analyses data in blocks of 50k rows, builds
> a
> > list of rows which need changing to the same value, then applied them in
> > bulk using update() + F().
>
> Hmmm, what do you mean by "bulk using update() + F()? Something like
> "update sometable set somefield1 = somevalue1, somefield2 = somevalue2
> where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
> processed 13.8 million rows pr minute? What kind of hardware did you
> use?
>
> Thomas
>
> > Here's our benchmark:
> > (42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range:
> 72300921
> > ~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
> > ('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
> > (44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range:
> 72350921
> > ~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
> > ('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
> > (55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range:
> 72400921
> > ~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
> > ('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
> > (42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range:
> 72450921
> > ~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
> > ('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
> > (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
> 72500921
> > ~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
> > ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
> > Could you let me know if those benchmarks are better/worse than using
> DSE?
> > I'd be interested to see the comparison!
> > Cal
> > On Wed, Jun 22, 2011 at 2:31 PM, Thomas Weholt <[email protected]>
> > wrote:
> >>
> >> Yes! I'm in.
> >>
> >> Out of curiosity: When inserting lots of data, how do you do it? Using
> >> the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
> >> wrote DSE to solve inserting/updating huge sets of data, but if
> >> there's a better way to do it that would be especially interesting to
> >> hear more about ( and sorry for the self promotion ).
> >>
> >> Regards,
> >> Thomas
> >>
> >> On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
> >> <[email protected]> wrote:
> >> > Hi all,
> >> > Some of you may have noticed, in the last few months I've done quite a
> >> > few
> >> > posts/snippets about handling large data sets in Django. At the end of
> >> > this
> >> > month (after what seems like a lifetime of trial and error), we're
> >> > finally
> >> > going to be releasing a new site which holds around 40mil+ rows of
> data,
> >> > grows by about 300-500k rows each day, handles 5GB of uploads per day,
> >> > and
> >> > can handle around 1024 requests per second on stress test on a
> >> > moderately
> >> > spec'd server.
> >> > As the entire thing is written in Django (and a bunch of other open
> >> > source
> >> > products), I'd really like to give something back to the
> >> > community. (stack
> >> > incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
> >> > MySQL/NGINX/supervisord/debian etc)
> >> > Therefore, I'd like to see if there would be any interest in webcast
> in
> >> > which I would explain how we handle such large amounts of data, the
> >> > trial
> >> > and error processes we went through, some really neat tricks we've
> done
> >> > to
> >> > avoid bottlenecks, our own approach to smart content filtering, and
> some
> >> > of
> >> > the valuable lessons we have learned. The webcast would be completely
> >> > free
> >> > of charge, last a couple of hours (with a short break) and anyone can
> >> > attend. I'd also offer up a Q&A session at the end.
> >> > If you're interested, please reply on-list so others can see.
> >> > Thanks
> >> > Cal
> >> >
> >> > --
> >> > You received this message because you are subscribed to the Google
> >> > Groups
> >> > "Django users" group.
> >> > To post to this group, send email to [email protected].
> >> > To unsubscribe from this group, send email to
> >> > [email protected].
> >> > For more options, visit this group at
> >> > http://groups.google.com/group/django-users?hl=en.
> >> >
> >>
> >>
> >>
> >> --
> >> Mvh/Best regards,
> >> Thomas Weholt
> >> http://www.weholt.org
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups
> >> "Django users" group.
> >> To post to this group, send email to [email protected].
> >> To unsubscribe from this group, send email to
> >> [email protected].
> >> For more options, visit this group at
> >> http://groups.google.com/group/django-users?hl=en.
> >>
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Django users" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected].
> > For more options, visit this group at
> > http://groups.google.com/group/django-users?hl=en.
> >
>
>
>
> --
> Mvh/Best regards,
> Thomas Weholt
> http://www.weholt.org
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Reply via email to