Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Andre Terra Wed, 22 Jun 2011 07:26:18 -0700

Hello, Cal

First of all, congrats on the newborn! The Django community will surely
benefit from having yet another success story, especially considering how
big this project sounds. Is there any chance you could open-source some of
your custom made improvements so that they could eventually be merged to
trunk?


I definitely noticed how you mentioned large dbs in the past few months. I,
along with many others I assume, would surely like to attend the webcast,
with the only impediment being my schedule/timezone.

I recently asked about working with temporary tables for filtering/grouping
data from uploads and inserting queries from that temporary table onto a
permanent database. To make matters worse, I wanted to make this as flexible
as possible (i.e. dynamic models) so that everything could be managed from a
web app. Do you have any experience you could share about any of these use
cases? As far as I know, there's nothing in the ORM that replicates
PostgreSQL's CREATE TEMPORARY TABLE. My experience with SQL is rather
limited, but from asking around, it seems like my project could indeed
benefit from such a feature. If I had to guess, I would assume other DBMSs
would offer something similar, but being limited to Postgres is okay for me,
for now, anyway.



Cheers,
André


On Wed, Jun 22, 2011 at 10:56 AM, Cal Leeming [Simplicity Media Ltd] <
cal.leem...@simplicitymedialtd.co.uk> wrote:

> Also, the 13.8 minutes per million, is basically a benchmark based on the
> amount of db writes, and the total amount of time it took to execute (which
> was 51s).
>
> Please also note, this code is doing a *heavy* amount of content analysis,
> but if you were to strip that out, the only overheads would be the
> map/filter/lambda, the time it takes to transmit to MySQL, and the time it
> takes for MySQL to perform the writes.
>
> The database hardware spec is:
>
> 1x X3440 quad core (2 cores assigned to MySQL).
> 12GB memory (4 GB assigned to MySQL).
> /var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1.
>
> Cal
>
>
> On Wed, Jun 22, 2011 at 2:52 PM, Cal Leeming [Simplicity Media Ltd] <
> cal.leem...@simplicitymedialtd.co.uk> wrote:
>
>> Sorry, let me explain a little better.
>>
>> (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
>> 72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked',
>> 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>>
>> map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked',
>> 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>>
>> In the above example, it has found 49659 rows which need 'is_checked'
>> changing to the value '1' (same principle applied to the other 3), giving a
>> total of 51,130 database writes, split into 4 queries.
>>
>> Those 4 fields have the IDs assigned to them:
>>
>>                                     if _f == 'block_images':
>>
>> _obj_incs.get('is_image_blocked').append(_hit_id)
>>                                         if _parent_id:
>>
>> _obj_incs.get('is_image_blocked').append(_parent_id)
>>
>> Then I loop through those fields, and do an update() using the necessary
>> IDs:
>>
>>                     # now apply the obj changes in bulk (massive speed
>> improvements)
>>                     for _key, _value in _obj_incs.iteritems():
>>                         # update the child object
>>                         Post.objects.filter(
>>                             id__in = _value
>>                         ).update(
>>                             **{
>>                                 _key : 1
>>                             }
>>                         )
>>
>> So in simple terms, we're not doing 51 thousand update queries, instead
>> we're grouping them into bulk queries based on the row to be updated. It
>> doesn't yet to grouping based on key AND value, simply because we didn't
>> need it at the time, but if we release the code for public use,
>> we'd definitely add this in.
>>
>> Hope this makes sense, let me know if I didn't explain it very well lol.
>>
>> Cal
>>
>> On Wed, Jun 22, 2011 at 2:45 PM, Thomas Weholt 
>> <thomas.weh...@gmail.com>wrote:
>>
>>> On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
>>> <cal.leem...@simplicitymedialtd.co.uk> wrote:
>>> > Hey Thomas,
>>> > Yeah we actually spoke a little while ago about DSE. In the end, we
>>> actually
>>> > used a custom approach which analyses data in blocks of 50k rows,
>>> builds a
>>> > list of rows which need changing to the same value, then applied them
>>> in
>>> > bulk using update() + F().
>>>
>>> Hmmm, what do you mean by "bulk using update() + F()? Something like
>>> "update sometable set somefield1 = somevalue1, somefield2 = somevalue2
>>> where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
>>> processed 13.8 million rows pr minute? What kind of hardware did you
>>> use?
>>>
>>> Thomas
>>>
>>> > Here's our benchmark:
>>> > (42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range:
>>> 72300921
>>> > ~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
>>> > ('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
>>> > (44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range:
>>> 72350921
>>> > ~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
>>> > ('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
>>> > (55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range:
>>> 72400921
>>> > ~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
>>> > ('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
>>> > (42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range:
>>> 72450921
>>> > ~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
>>> > ('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
>>> > (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
>>> 72500921
>>> > ~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
>>> > ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>>> > Could you let me know if those benchmarks are better/worse than using
>>> DSE?
>>> > I'd be interested to see the comparison!
>>> > Cal
>>> > On Wed, Jun 22, 2011 at 2:31 PM, Thomas Weholt <
>>> thomas.weh...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Yes! I'm in.
>>> >>
>>> >> Out of curiosity: When inserting lots of data, how do you do it? Using
>>> >> the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
>>> >> wrote DSE to solve inserting/updating huge sets of data, but if
>>> >> there's a better way to do it that would be especially interesting to
>>> >> hear more about ( and sorry for the self promotion ).
>>> >>
>>> >> Regards,
>>> >> Thomas
>>> >>
>>> >> On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
>>> >> <cal.leem...@simplicitymedialtd.co.uk> wrote:
>>> >> > Hi all,
>>> >> > Some of you may have noticed, in the last few months I've done quite
>>> a
>>> >> > few
>>> >> > posts/snippets about handling large data sets in Django. At the end
>>> of
>>> >> > this
>>> >> > month (after what seems like a lifetime of trial and error), we're
>>> >> > finally
>>> >> > going to be releasing a new site which holds around 40mil+ rows of
>>> data,
>>> >> > grows by about 300-500k rows each day, handles 5GB of uploads per
>>> day,
>>> >> > and
>>> >> > can handle around 1024 requests per second on stress test on a
>>> >> > moderately
>>> >> > spec'd server.
>>> >> > As the entire thing is written in Django (and a bunch of other open
>>> >> > source
>>> >> > products), I'd really like to give something back to the
>>> >> > community. (stack
>>> >> > incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
>>> >> > MySQL/NGINX/supervisord/debian etc)
>>> >> > Therefore, I'd like to see if there would be any interest in webcast
>>> in
>>> >> > which I would explain how we handle such large amounts of data, the
>>> >> > trial
>>> >> > and error processes we went through, some really neat tricks we've
>>> done
>>> >> > to
>>> >> > avoid bottlenecks, our own approach to smart content filtering, and
>>> some
>>> >> > of
>>> >> > the valuable lessons we have learned. The webcast would be
>>> completely
>>> >> > free
>>> >> > of charge, last a couple of hours (with a short break) and anyone
>>> can
>>> >> > attend. I'd also offer up a Q&A session at the end.
>>> >> > If you're interested, please reply on-list so others can see.
>>> >> > Thanks
>>> >> > Cal
>>> >> >
>>> >> > --
>>> >> > You received this message because you are subscribed to the Google
>>> >> > Groups
>>> >> > "Django users" group.
>>> >> > To post to this group, send email to django-users@googlegroups.com.
>>> >> > To unsubscribe from this group, send email to
>>> >> > django-users+unsubscr...@googlegroups.com.
>>> >> > For more options, visit this group at
>>> >> > http://groups.google.com/group/django-users?hl=en.
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Mvh/Best regards,
>>> >> Thomas Weholt
>>> >> http://www.weholt.org
>>> >>
>>> >> --
>>> >> You received this message because you are subscribed to the Google
>>> Groups
>>> >> "Django users" group.
>>> >> To post to this group, send email to django-users@googlegroups.com.
>>> >> To unsubscribe from this group, send email to
>>> >> django-users+unsubscr...@googlegroups.com.
>>> >> For more options, visit this group at
>>> >> http://groups.google.com/group/django-users?hl=en.
>>> >>
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> Groups
>>> > "Django users" group.
>>> > To post to this group, send email to django-users@googlegroups.com.
>>> > To unsubscribe from this group, send email to
>>> > django-users+unsubscr...@googlegroups.com.
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/django-users?hl=en.
>>> >
>>>
>>>
>>>
>>> --
>>> Mvh/Best regards,
>>> Thomas Weholt
>>> http://www.weholt.org
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "Django users" group.
>>> To post to this group, send email to django-users@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> django-users+unsubscr...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/django-users?hl=en.
>>>
>>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to django-users@googlegroups.com.
> To unsubscribe from this group, send email to
> django-users+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Reply via email to