On Mon, 2008-11-17 at 12:15 -0500, Jason Gottshall wrote: > demerphq wrote: > > 2008/11/16 Jon Schutz <[EMAIL PROTECTED]>: > >> I've just done some benchmarking on MySQL 5 for queries where you need > >> to select back many individual records, e.g. you could use > >> > >> SELECT ... WHERE id=? (and iterate for all possible id values) > >> > >> or > >> > >> SELECT ... WHERE id=1 OR id=2 OR id=3 OR ... > >> > >> or > >> > >> SELECT ... WHERE id IN (1, 2, 3, ...) > >> > >> and there are several other ways to do it. > >> > >> What I found was that using OR, at a certain point (when selecting > > >> about 1000 keys) the performance went exponentially sour - like, to > >> quote one benchmark, the OR query took 6400 secs vs 30s typical for > >> other methods. I've written the results up in more detail at > >> http://notes.jschutz.net/19/perl/mysql-many-row-select-performance - > >> have a look at the graph if nothing else. Using 'IN' is the best > >> solution for any sort of large key set. > >> > >> I figure this is relevant to DBIx::Class since it uses SQL::Abstract > >> which will translate a search spec such as A => [ 1 .. 1000 ] into (A=? > >> OR A=? ... ) 1000 times, which could be a real gotcha (indeed, some of > >> my own code, which I shall now be reviewing, can expect arrays of tens > >> of thousands of keys...). > >> > >> I was wondering - has anyone else seen similar behaviour? > > > > Yes. In some cases or is faster (older MySQL servers) in some cases IN > > is faster (newer MySQL). In most cases they ARENT equivalent, the > > optimiser is just not smart enough, nor designed for this type of > > equivalence conversion. Also, with either, at a certain point the > > optimiser figures out it cant use the indexes on the column and turns > > things into a table scan, and with the IN, it can construct an in > > memory hash, and check each value in O(1) time, where as with the OR > > approach it doesnt, and thus operates in O(k) time. > > > > If you are generate IN lists of 10s of thousands of items I bet you > > would be better to create an indexed temporary table containing the > > list, and then use join syntax. > > We've found that Oracle (9 or 10) seems to have a hard limit of 1000 > values in an "IN()" criterion, so dbic probably shouldn't try to use > that as it's default behavior. I'm intrigued by the idea of using an > indexed temp table... >
That's a hard-coded hard limit, not like mysql which has a max_allowed_packet variable that limits the size of the query? Trouble is, even using OR ... as DBIC currently does, will eventually run into MySQL's max_allowed_packet limit (and presumably equivalent limits on any database), so the only way to be 100% reliable is to break the query into multiple queries, each under max_allowed_packet. I imagine it would be quite a challenge to abstract this reliably in SQL::Abstract. -- Jon Schutz My tech notes http://notes.jschutz.net Chief Technology Officer http://www.youramigo.com YourAmigo -- Jon Schutz My tech notes http://notes.jschutz.net Chief Technology Officer http://www.youramigo.com YourAmigo _______________________________________________ List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class IRC: irc.perl.org#dbix-class SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/ Searchable Archive: http://www.grokbase.com/group/[email protected]
