On 2012-10-09 17:24, Bob MacCallum wrote:
> Just for closure, I am now in business.
>
> [mysqld]
> tmp_table_size = 5G
>
>
> Background for whoever cares:
>
> I'm running something reporter-wise in parallel (4 threads) and wanted
> roughly equal workload per "chunk" of reporters (some reporters have
> more experiments+data than others) so having them in random order
> before chunking does the trick.
>
> The tmp tables that the SELECT * FROM Reporters ORDER BY RAND(123)
> creates are 3.2G each (times 4) and while doing this there's some
> serious I/O clogging the machine but it pulls through.  If only gnu
> parallel had a -stagger [secs] option!


ORDER BY RAND() is known for performance issues once the number of rows 
grows large. I guess it is because MySQL need to create a temporary 
table with *all* the original data plus the random number for each row 
before the data can be sorted.

It might be more efficient to first do 'select id from Reporters' 
(you'll need to use the Reporter.getDynamicQuery() for this), put all 
id's in an array, randomize that in memory and then bring in the actual 
reporter data one by one (or 100 by 100 or something). That should at 
least avoid the temporary table issue since the id's can be pulled in 
directly from the primary key index. In theory it should be quicker, but 
I guess it has to be verified in practice as well.

/Nicklas


------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
The BASE general discussion mailing list
basedb-users@lists.sourceforge.net
unsubscribe: send a mail with subject "unsubscribe" to
basedb-users-requ...@lists.sourceforge.net

Reply via email to