Also, the 13.8 minutes per million, is basically a benchmark based on the amount of db writes, and the total amount of time it took to execute (which was 51s).
Please also note, this code is doing a *heavy* amount of content analysis, but if you were to strip that out, the only overheads would be the map/filter/lambda, the time it takes to transmit to MySQL, and the time it takes for MySQL to perform the writes. The database hardware spec is: 1x X3440 quad core (2 cores assigned to MySQL). 12GB memory (4 GB assigned to MySQL). /var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1. Cal On Wed, Jun 22, 2011 at 2:52 PM, Cal Leeming [Simplicity Media Ltd] < cal.leem...@simplicitymedialtd.co.uk> wrote: > Sorry, let me explain a little better. > > (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range: > 72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked', > 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)] > > map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked', > 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)] > > In the above example, it has found 49659 rows which need 'is_checked' > changing to the value '1' (same principle applied to the other 3), giving a > total of 51,130 database writes, split into 4 queries. > > Those 4 fields have the IDs assigned to them: > > if _f == 'block_images': > > _obj_incs.get('is_image_blocked').append(_hit_id) > if _parent_id: > > _obj_incs.get('is_image_blocked').append(_parent_id) > > Then I loop through those fields, and do an update() using the necessary > IDs: > > # now apply the obj changes in bulk (massive speed > improvements) > for _key, _value in _obj_incs.iteritems(): > # update the child object > Post.objects.filter( > id__in = _value > ).update( > **{ > _key : 1 > } > ) > > So in simple terms, we're not doing 51 thousand update queries, instead > we're grouping them into bulk queries based on the row to be updated. It > doesn't yet to grouping based on key AND value, simply because we didn't > need it at the time, but if we release the code for public use, > we'd definitely add this in. > > Hope this makes sense, let me know if I didn't explain it very well lol. > > Cal > > On Wed, Jun 22, 2011 at 2:45 PM, Thomas Weholt <thomas.weh...@gmail.com>wrote: > >> On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd] >> <cal.leem...@simplicitymedialtd.co.uk> wrote: >> > Hey Thomas, >> > Yeah we actually spoke a little while ago about DSE. In the end, we >> actually >> > used a custom approach which analyses data in blocks of 50k rows, builds >> a >> > list of rows which need changing to the same value, then applied them in >> > bulk using update() + F(). >> >> Hmmm, what do you mean by "bulk using update() + F()? Something like >> "update sometable set somefield1 = somevalue1, somefield2 = somevalue2 >> where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you >> processed 13.8 million rows pr minute? What kind of hardware did you >> use? >> >> Thomas >> >> > Here's our benchmark: >> > (42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range: >> 72300921 >> > ~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426), >> > ('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)] >> > (44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range: >> 72350921 >> > ~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481), >> > ('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)] >> > (55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range: >> 72400921 >> > ~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627), >> > ('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)] >> > (42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range: >> 72450921 >> > ~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674), >> > ('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)] >> > (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range: >> 72500921 >> > ~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659), >> > ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)] >> > Could you let me know if those benchmarks are better/worse than using >> DSE? >> > I'd be interested to see the comparison! >> > Cal >> > On Wed, Jun 22, 2011 at 2:31 PM, Thomas Weholt <thomas.weh...@gmail.com >> > >> > wrote: >> >> >> >> Yes! I'm in. >> >> >> >> Out of curiosity: When inserting lots of data, how do you do it? Using >> >> the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I >> >> wrote DSE to solve inserting/updating huge sets of data, but if >> >> there's a better way to do it that would be especially interesting to >> >> hear more about ( and sorry for the self promotion ). >> >> >> >> Regards, >> >> Thomas >> >> >> >> On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd] >> >> <cal.leem...@simplicitymedialtd.co.uk> wrote: >> >> > Hi all, >> >> > Some of you may have noticed, in the last few months I've done quite >> a >> >> > few >> >> > posts/snippets about handling large data sets in Django. At the end >> of >> >> > this >> >> > month (after what seems like a lifetime of trial and error), we're >> >> > finally >> >> > going to be releasing a new site which holds around 40mil+ rows of >> data, >> >> > grows by about 300-500k rows each day, handles 5GB of uploads per >> day, >> >> > and >> >> > can handle around 1024 requests per second on stress test on a >> >> > moderately >> >> > spec'd server. >> >> > As the entire thing is written in Django (and a bunch of other open >> >> > source >> >> > products), I'd really like to give something back to the >> >> > community. (stack >> >> > incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona >> >> > MySQL/NGINX/supervisord/debian etc) >> >> > Therefore, I'd like to see if there would be any interest in webcast >> in >> >> > which I would explain how we handle such large amounts of data, the >> >> > trial >> >> > and error processes we went through, some really neat tricks we've >> done >> >> > to >> >> > avoid bottlenecks, our own approach to smart content filtering, and >> some >> >> > of >> >> > the valuable lessons we have learned. The webcast would be completely >> >> > free >> >> > of charge, last a couple of hours (with a short break) and anyone can >> >> > attend. I'd also offer up a Q&A session at the end. >> >> > If you're interested, please reply on-list so others can see. >> >> > Thanks >> >> > Cal >> >> > >> >> > -- >> >> > You received this message because you are subscribed to the Google >> >> > Groups >> >> > "Django users" group. >> >> > To post to this group, send email to django-users@googlegroups.com. >> >> > To unsubscribe from this group, send email to >> >> > django-users+unsubscr...@googlegroups.com. >> >> > For more options, visit this group at >> >> > http://groups.google.com/group/django-users?hl=en. >> >> > >> >> >> >> >> >> >> >> -- >> >> Mvh/Best regards, >> >> Thomas Weholt >> >> http://www.weholt.org >> >> >> >> -- >> >> You received this message because you are subscribed to the Google >> Groups >> >> "Django users" group. >> >> To post to this group, send email to django-users@googlegroups.com. >> >> To unsubscribe from this group, send email to >> >> django-users+unsubscr...@googlegroups.com. >> >> For more options, visit this group at >> >> http://groups.google.com/group/django-users?hl=en. >> >> >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups >> > "Django users" group. >> > To post to this group, send email to django-users@googlegroups.com. >> > To unsubscribe from this group, send email to >> > django-users+unsubscr...@googlegroups.com. >> > For more options, visit this group at >> > http://groups.google.com/group/django-users?hl=en. >> > >> >> >> >> -- >> Mvh/Best regards, >> Thomas Weholt >> http://www.weholt.org >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Django users" group. >> To post to this group, send email to django-users@googlegroups.com. >> To unsubscribe from this group, send email to >> django-users+unsubscr...@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/django-users?hl=en. >> >> > -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.