Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Cal Leeming [Simplicity Media Ltd] Wed, 22 Jun 2011 07:47:26 -0700

On Wed, Jun 22, 2011 at 3:25 PM, Andre Terra <andrete...@gmail.com> wrote:


> Hello, Cal
>
> First of all, congrats on the newborn! The Django community will surely
> benefit from having yet another success story, especially considering how
> big this project sounds. Is there any chance you could open-source some of
> your custom made improvements so that they could eventually be merged to
> trunk?
>

Thank you! Yeah, the plan is to release as much of the improvements as open
source as possible. Although I'd rely heavily on the community to make them
'patch worthy' for the core, as the amount of spare time I have is somewhat
limited.

The improvements list is growing by the day, and I usually try and post as
many snippets as I can, and/or tickets etc.

It sounds like Thomas's DSE might be the perfect place for the bulk update
code too.


>
> I definitely noticed how you mentioned large dbs in the past few months. I,
> along with many others I assume, would surely like to attend the webcast,
> with the only impediment being my schedule/timezone.
>

Once we've got a list of all the people who want to attend, I'll send out a
mail asking for everyones timezone and availability, so we can figure out
what is best for everyone.


>
> I recently asked about working with temporary tables for filtering/grouping
> data from uploads and inserting queries from that temporary table onto a
> permanent database. To make matters worse, I wanted to make this as flexible
> as possible (i.e. dynamic models) so that everything could be managed from a
> web app. Do you have any experience you could share about any of these use
> cases? As far as I know, there's nothing in the ORM that replicates
> PostgreSQL's CREATE TEMPORARY TABLE. My experience with SQL is rather
> limited, but from asking around, it seems like my project could indeed
> benefit from such a feature. If I had to guess, I would assume other DBMSs
> would offer something similar, but being limited to Postgres is okay for me,
> for now, anyway.
>

I haven't had any exposure to Postgres, but my experience with temporary
tables hasn't been a nice one (in regards to MySQL at least). MySQL has many
gotchas when it comes to temporary tables and indexing, and on more than one
occasion, I found it was actually quicker to analyse/mangle/re-insert the
data via Python code, than it was to attempt the modifications within MySQL
using a temporary table.

It really does depend on what your data is, and what you want to do with it,
which can make planning ahead somewhat tedious lol.

For our stuff, when we need to do bulk modifications, we have a filtering
rules list which is ran every hour against new rows (with is_checked=1 set
on rows which have been checked). We then use bulk queries of 50k (id >= 0
AND id < 50000), rather than using LIMIT/OFFSET (because LIMIT/OFFSET gets
slower and slower the larger the result set). Those queries are
analysed/mangled within a transaction, and bulk updated using the method
mentioned in the reply to Thomas.

Sadly though, I can't say if the methods we use would be suitable for you,
as we haven't tried it against Postgres, and we've only tested it against
our own data set + requirements. This is what I mean by trial and error,
it's a pain in the ass :)



>
>
>
> Cheers,
> André
>
>
> On Wed, Jun 22, 2011 at 10:56 AM, Cal Leeming [Simplicity Media Ltd] <
> cal.leem...@simplicitymedialtd.co.uk> wrote:
>
>> Also, the 13.8 minutes per million, is basically a benchmark based on the
>> amount of db writes, and the total amount of time it took to execute (which
>> was 51s).
>>
>> Please also note, this code is doing a *heavy* amount of content analysis,
>> but if you were to strip that out, the only overheads would be the
>> map/filter/lambda, the time it takes to transmit to MySQL, and the time it
>> takes for MySQL to perform the writes.
>>
>> The database hardware spec is:
>>
>> 1x X3440 quad core (2 cores assigned to MySQL).
>> 12GB memory (4 GB assigned to MySQL).
>> /var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1.
>>
>> Cal
>>
>>
>> On Wed, Jun 22, 2011 at 2:52 PM, Cal Leeming [Simplicity Media Ltd] <
>> cal.leem...@simplicitymedialtd.co.uk> wrote:
>>
>>> Sorry, let me explain a little better.
>>>
>>> (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
>>> 72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked',
>>> 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>>>
>>> map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked',
>>> 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>>>
>>> In the above example, it has found 49659 rows which need 'is_checked'
>>> changing to the value '1' (same principle applied to the other 3), giving a
>>> total of 51,130 database writes, split into 4 queries.
>>>
>>> Those 4 fields have the IDs assigned to them:
>>>
>>>                                     if _f == 'block_images':
>>>
>>> _obj_incs.get('is_image_blocked').append(_hit_id)
>>>                                         if _parent_id:
>>>
>>> _obj_incs.get('is_image_blocked').append(_parent_id)
>>>
>>> Then I loop through those fields, and do an update() using the necessary
>>> IDs:
>>>
>>>                     # now apply the obj changes in bulk (massive speed
>>> improvements)
>>>                     for _key, _value in _obj_incs.iteritems():
>>>                         # update the child object
>>>                         Post.objects.filter(
>>>                             id__in = _value
>>>                         ).update(
>>>                             **{
>>>                                 _key : 1
>>>                             }
>>>                         )
>>>
>>> So in simple terms, we're not doing 51 thousand update queries, instead
>>> we're grouping them into bulk queries based on the row to be updated. It
>>> doesn't yet to grouping based on key AND value, simply because we didn't
>>> need it at the time, but if we release the code for public use,
>>> we'd definitely add this in.
>>>
>>> Hope this makes sense, let me know if I didn't explain it very well lol.
>>>
>>> Cal
>>>
>>> On Wed, Jun 22, 2011 at 2:45 PM, Thomas Weholt 
>>> <thomas.weh...@gmail.com>wrote:
>>>
>>>> On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
>>>> <cal.leem...@simplicitymedialtd.co.uk> wrote:
>>>> > Hey Thomas,
>>>> > Yeah we actually spoke a little while ago about DSE. In the end, we
>>>> actually
>>>> > used a custom approach which analyses data in blocks of 50k rows,
>>>> builds a
>>>> > list of rows which need changing to the same value, then applied them
>>>> in
>>>> > bulk using update() + F().
>>>>
>>>> Hmmm, what do you mean by "bulk using update() + F()? Something like
>>>> "update sometable set somefield1 = somevalue1, somefield2 = somevalue2
>>>> where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
>>>> processed 13.8 million rows pr minute? What kind of hardware did you
>>>> use?
>>>>
>>>> Thomas
>>>>
>>>> > Here's our benchmark:
>>>> > (42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range:
>>>> 72300921
>>>> > ~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426),
>>>> > ('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
>>>> > (44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range:
>>>> 72350921
>>>> > ~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481),
>>>> > ('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
>>>> > (55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range:
>>>> 72400921
>>>> > ~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627),
>>>> > ('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
>>>> > (42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range:
>>>> 72450921
>>>> > ~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674),
>>>> > ('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
>>>> > (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
>>>> 72500921
>>>> > ~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659),
>>>> > ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
>>>> > Could you let me know if those benchmarks are better/worse than using
>>>> DSE?
>>>> > I'd be interested to see the comparison!
>>>> > Cal
>>>> > On Wed, Jun 22, 2011 at 2:31 PM, Thomas Weholt <
>>>> thomas.weh...@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Yes! I'm in.
>>>> >>
>>>> >> Out of curiosity: When inserting lots of data, how do you do it?
>>>> Using
>>>> >> the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ?
>>>> I
>>>> >> wrote DSE to solve inserting/updating huge sets of data, but if
>>>> >> there's a better way to do it that would be especially interesting to
>>>> >> hear more about ( and sorry for the self promotion ).
>>>> >>
>>>> >> Regards,
>>>> >> Thomas
>>>> >>
>>>> >> On Wed, Jun 22, 2011 at 3:15 PM, Cal Leeming [Simplicity Media Ltd]
>>>> >> <cal.leem...@simplicitymedialtd.co.uk> wrote:
>>>> >> > Hi all,
>>>> >> > Some of you may have noticed, in the last few months I've done
>>>> quite a
>>>> >> > few
>>>> >> > posts/snippets about handling large data sets in Django. At the end
>>>> of
>>>> >> > this
>>>> >> > month (after what seems like a lifetime of trial and error), we're
>>>> >> > finally
>>>> >> > going to be releasing a new site which holds around 40mil+ rows of
>>>> data,
>>>> >> > grows by about 300-500k rows each day, handles 5GB of uploads per
>>>> day,
>>>> >> > and
>>>> >> > can handle around 1024 requests per second on stress test on a
>>>> >> > moderately
>>>> >> > spec'd server.
>>>> >> > As the entire thing is written in Django (and a bunch of other open
>>>> >> > source
>>>> >> > products), I'd really like to give something back to the
>>>> >> > community. (stack
>>>> >> > incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
>>>> >> > MySQL/NGINX/supervisord/debian etc)
>>>> >> > Therefore, I'd like to see if there would be any interest in
>>>> webcast in
>>>> >> > which I would explain how we handle such large amounts of data, the
>>>> >> > trial
>>>> >> > and error processes we went through, some really neat tricks we've
>>>> done
>>>> >> > to
>>>> >> > avoid bottlenecks, our own approach to smart content filtering, and
>>>> some
>>>> >> > of
>>>> >> > the valuable lessons we have learned. The webcast would be
>>>> completely
>>>> >> > free
>>>> >> > of charge, last a couple of hours (with a short break) and anyone
>>>> can
>>>> >> > attend. I'd also offer up a Q&A session at the end.
>>>> >> > If you're interested, please reply on-list so others can see.
>>>> >> > Thanks
>>>> >> > Cal
>>>> >> >
>>>> >> > --
>>>> >> > You received this message because you are subscribed to the Google
>>>> >> > Groups
>>>> >> > "Django users" group.
>>>> >> > To post to this group, send email to django-users@googlegroups.com
>>>> .
>>>> >> > To unsubscribe from this group, send email to
>>>> >> > django-users+unsubscr...@googlegroups.com.
>>>> >> > For more options, visit this group at
>>>> >> > http://groups.google.com/group/django-users?hl=en.
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Mvh/Best regards,
>>>> >> Thomas Weholt
>>>> >> http://www.weholt.org
>>>> >>
>>>> >> --
>>>> >> You received this message because you are subscribed to the Google
>>>> Groups
>>>> >> "Django users" group.
>>>> >> To post to this group, send email to django-users@googlegroups.com.
>>>> >> To unsubscribe from this group, send email to
>>>> >> django-users+unsubscr...@googlegroups.com.
>>>> >> For more options, visit this group at
>>>> >> http://groups.google.com/group/django-users?hl=en.
>>>> >>
>>>> >
>>>> > --
>>>> > You received this message because you are subscribed to the Google
>>>> Groups
>>>> > "Django users" group.
>>>> > To post to this group, send email to django-users@googlegroups.com.
>>>> > To unsubscribe from this group, send email to
>>>> > django-users+unsubscr...@googlegroups.com.
>>>> > For more options, visit this group at
>>>> > http://groups.google.com/group/django-users?hl=en.
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Mvh/Best regards,
>>>> Thomas Weholt
>>>> http://www.weholt.org
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Django users" group.
>>>> To post to this group, send email to django-users@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> django-users+unsubscr...@googlegroups.com.
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/django-users?hl=en.
>>>>
>>>>
>>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Django users" group.
>> To post to this group, send email to django-users@googlegroups.com.
>> To unsubscribe from this group, send email to
>> django-users+unsubscr...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/django-users?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to django-users@googlegroups.com.
> To unsubscribe from this group, send email to
> django-users+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Reply via email to