Also, the 13.8 minutes per million, is basically a benchmark based on the
amount of db writes, and the total amount of time it took to execute (which
was 51s).

Please also note, this code is doing a *heavy* amount of content analysis,
but if you were to strip that out, the only overheads would be the
map/filter/lambda, the time it takes to transmit to MySQL, and the time it
takes for MySQL to perform the writes.

The database hardware spec is:

1x X3440 quad core (2 cores assigned to MySQL).
12GB memory (4 GB assigned to MySQL).
/var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1.

Cal


On Wed, Jun 22, 2011 at 2:52 PM, Cal Leeming [Simplicity Media Ltd] <
cal.leem...@simplicitymedialtd.co.uk> wrote:

> Sorry, let me explain a little better.
>
> (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
> 72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked',
> 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>
> map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked',
> 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>
> In the above example, it has found 49659 rows which need 'is_checked'
> changing to the value '1' (same principle applied to the other 3), giving a
> total of 51,130 database writes, split into 4 queries.
>
> Those 4 fields have the IDs assigned to them:
>
>                                     if _f == 'block_images':
>
> _obj_incs.get('is_image_blocked').append(_hit_id)
>                                         if _parent_id:
>
> _obj_incs.get('is_image_blocked').append(_parent_id)
>
> Then I loop through those fields, and do an update() using the necessary
> IDs:
>
>                     # now apply the obj changes in bulk (massive speed
> improvements)
>                     for _key, _value in _obj_incs.iteritems():
>                         # update the child object
>                         Post.objects.filter(
>                             id__in = _value
>                         ).update(
>                             **{
>                                 _key : 1
>                             }
>                         )
>
> So in simple terms, we're not doing 51 thousand update queries, instead
> we're grouping them into bulk queries based on the row to be updated. It
> doesn't yet to grouping based on key AND value, simply because we didn't
> need it at the time, but if we release the code for public use,
> we'd definitely add this in.
>
> Hope this makes sense, let me know if I didn't explain it very well lol.
>
> Cal
>
> On Wed, Jun 22, 2011 at 2:45 PM, Thomas Weholt <thomas.weh...@gmail.com>wrote:
>
>> On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
>> <cal.leem...@simplicitymedialtd.co.uk> wrote:
>> > Hey Thomas,
>> > Yeah we actually spoke a little while ago about DSE. In the end, we
>> actually
>> > used a custom approach which analyses data in blocks of 50k rows, builds
>> a
>> > list of rows which need changing to the same value, then applied them in
>> > bulk using update() + F().
>>
>> Hmmm, what do you mean by "bulk using update() + F()? Something like
>> "update sometable set somefield1 = somevalue1, somefield2 = somevalue2
>> where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
>> processed 13.8 million rows pr minute? What kind of hardware did you
>> use?
>>
>> Thomas
>>
>> > Here's our benchmark:
>> > (42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range:
>> 72300921
>> > ~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
>> > ('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
>> > (44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range:
>> 72350921
>> > ~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
>> > ('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
>> > (55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range:
>> 72400921
>> > ~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
>> > ('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
>> > (42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range:
>> 72450921
>> > ~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
>> > ('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
>> > (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
>> 72500921
>> > ~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
>> > ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>> > Could you let me know if those benchmarks are better/worse than using
>> DSE?
>> > I'd be interested to see the comparison!
>> > Cal
>> > On Wed, Jun 22, 2011 at 2:31 PM, Thomas Weholt <thomas.weh...@gmail.com
>> >
>> > wrote:
>> >>
>> >> Yes! I'm in.
>> >>
>> >> Out of curiosity: When inserting lots of data, how do you do it? Using
>> >> the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
>> >> wrote DSE to solve inserting/updating huge sets of data, but if
>> >> there's a better way to do it that would be especially interesting to
>> >> hear more about ( and sorry for the self promotion ).
>> >>
>> >> Regards,
>> >> Thomas
>> >>
>> >> On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
>> >> <cal.leem...@simplicitymedialtd.co.uk> wrote:
>> >> > Hi all,
>> >> > Some of you may have noticed, in the last few months I've done quite
>> a
>> >> > few
>> >> > posts/snippets about handling large data sets in Django. At the end
>> of
>> >> > this
>> >> > month (after what seems like a lifetime of trial and error), we're
>> >> > finally
>> >> > going to be releasing a new site which holds around 40mil+ rows of
>> data,
>> >> > grows by about 300-500k rows each day, handles 5GB of uploads per
>> day,
>> >> > and
>> >> > can handle around 1024 requests per second on stress test on a
>> >> > moderately
>> >> > spec'd server.
>> >> > As the entire thing is written in Django (and a bunch of other open
>> >> > source
>> >> > products), I'd really like to give something back to the
>> >> > community. (stack
>> >> > incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
>> >> > MySQL/NGINX/supervisord/debian etc)
>> >> > Therefore, I'd like to see if there would be any interest in webcast
>> in
>> >> > which I would explain how we handle such large amounts of data, the
>> >> > trial
>> >> > and error processes we went through, some really neat tricks we've
>> done
>> >> > to
>> >> > avoid bottlenecks, our own approach to smart content filtering, and
>> some
>> >> > of
>> >> > the valuable lessons we have learned. The webcast would be completely
>> >> > free
>> >> > of charge, last a couple of hours (with a short break) and anyone can
>> >> > attend. I'd also offer up a Q&A session at the end.
>> >> > If you're interested, please reply on-list so others can see.
>> >> > Thanks
>> >> > Cal
>> >> >
>> >> > --
>> >> > You received this message because you are subscribed to the Google
>> >> > Groups
>> >> > "Django users" group.
>> >> > To post to this group, send email to django-users@googlegroups.com.
>> >> > To unsubscribe from this group, send email to
>> >> > django-users+unsubscr...@googlegroups.com.
>> >> > For more options, visit this group at
>> >> > http://groups.google.com/group/django-users?hl=en.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Mvh/Best regards,
>> >> Thomas Weholt
>> >> http://www.weholt.org
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> Groups
>> >> "Django users" group.
>> >> To post to this group, send email to django-users@googlegroups.com.
>> >> To unsubscribe from this group, send email to
>> >> django-users+unsubscr...@googlegroups.com.
>> >> For more options, visit this group at
>> >> http://groups.google.com/group/django-users?hl=en.
>> >>
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups
>> > "Django users" group.
>> > To post to this group, send email to django-users@googlegroups.com.
>> > To unsubscribe from this group, send email to
>> > django-users+unsubscr...@googlegroups.com.
>> > For more options, visit this group at
>> > http://groups.google.com/group/django-users?hl=en.
>> >
>>
>>
>>
>> --
>> Mvh/Best regards,
>> Thomas Weholt
>> http://www.weholt.org
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Django users" group.
>> To post to this group, send email to django-users@googlegroups.com.
>> To unsubscribe from this group, send email to
>> django-users+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/django-users?hl=en.
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Reply via email to