Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Andre Terra Wed, 22 Jun 2011 08:01:17 -0700

On Wed, Jun 22, 2011 at 11:47 AM, Cal Leeming [Simplicity Media Ltd] <
cal.leem...@simplicitymedialtd.co.uk> wrote:


>
>
> On Wed, Jun 22, 2011 at 3:25 PM, Andre Terra <andrete...@gmail.com> wrote:
>
>>  Hello, Cal
>>
>> First of all, congrats on the newborn! The Django community will surely
>> benefit from having yet another success story, especially considering how
>> big this project sounds. Is there any chance you could open-source some of
>> your custom made improvements so that they could eventually be merged to
>> trunk?
>>
>
> Thank you! Yeah, the plan is to release as much of the improvements as open
> source as possible. Although I'd rely heavily on the community to make them
> 'patch worthy' for the core, as the amount of spare time I have is somewhat
> limited.
>
> The improvements list is growing by the day, and I usually try and post as
> many snippets as I can, and/or tickets etc.
>
> It sounds like Thomas's DSE might be the perfect place for the bulk update
> code too.
>

Thanks a lot for the quick reply. I'll keep my eyes open for the code, and
if unable to contribute with relevant modifications to the patches, I'll at
least try to doc and test them!



> I definitely noticed how you mentioned large dbs in the past few months. I,
>> along with many others I assume, would surely like to attend the webcast,
>> with the only impediment being my schedule/timezone.
>>
>
> Once we've got a list of all the people who want to attend, I'll send out a
> mail asking for everyones timezone and availability, so we can figure out
> what is best for everyone.
>

Definitely write me up for the list of attendees, then!



>  I recently asked about working with temporary tables for
>> filtering/grouping data from uploads and inserting queries from that
>> temporary table onto a permanent database. To make matters worse, I wanted
>> to make this as flexible as possible (i.e. dynamic models) so that
>> everything could be managed from a web app. Do you have any experience you
>> could share about any of these use cases? As far as I know, there's nothing
>> in the ORM that replicates PostgreSQL's CREATE TEMPORARY TABLE. My
>> experience with SQL is rather limited, but from asking around, it seems like
>> my project could indeed benefit from such a feature. If I had to guess, I
>> would assume other DBMSs would offer something similar, but being limited to
>> Postgres is okay for me, for now, anyway.
>>
>
> I haven't had any exposure to Postgres, but my experience with temporary
> tables hasn't been a nice one (in regards to MySQL at least). MySQL has many
> gotchas when it comes to temporary tables and indexing, and on more than one
> occasion, I found it was actually quicker to analyse/mangle/re-insert the
> data via Python code, than it was to attempt the modifications within MySQL
> using a temporary table.
>
> It really does depend on what your data is, and what you want to do with
> it, which can make planning ahead somewhat tedious lol.
>
> For our stuff, when we need to do bulk modifications, we have a filtering
> rules list which is ran every hour against new rows (with is_checked=1 set
> on rows which have been checked). We then use bulk queries of 50k (id >= 0
> AND id < 50000), rather than using LIMIT/OFFSET (because LIMIT/OFFSET gets
> slower and slower the larger the result set). Those queries are
> analysed/mangled within a transaction, and bulk updated using the method
> mentioned in the reply to Thomas.
>
> Sadly though, I can't say if the methods we use would be suitable for you,
> as we haven't tried it against Postgres, and we've only tested it against
> our own data set + requirements. This is what I mean by trial and error,
> it's a pain in the ass :)
>


Thanks again for your enlightening input. Even with our different
requirements, this was actually quite relevant as far as solving several
doubts I had on how to go about this project.


Cheers,

André


On Wed, Jun 22, 2011 at 10:56 AM, Cal Leeming [Simplicity Media Ltd] <
cal.leem...@simplicitymedialtd.co.uk> wrote:

> Also, the 13.8 minutes per million, is basically a benchmark based on the
> amount of db writes, and the total amount of time it took to execute (which
> was 51s).
>
> Please also note, this code is doing a *heavy* amount of content analysis,
> but if you were to strip that out, the only overheads would be the
> map/filter/lambda, the time it takes to transmit to MySQL, and the time it
> takes for MySQL to perform the writes.
>
> The database hardware spec is:
>
> 1x X3440 quad core (2 cores assigned to MySQL).
> 12GB memory (4 GB assigned to MySQL).
> /var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1.
>
> Cal
>
>
> On Wed, Jun 22, 2011 at 2:52 PM, Cal Leeming [Simplicity Media Ltd] <
> cal.leem...@simplicitymedialtd.co.uk> wrote:
>
>> Sorry, let me explain a little better.
>>
>
(...)

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Reply via email to