Thanks Arek. I didn't realise MySQL also suffered from this problem. Nonetheless, the existing Mart code doesn't take this issue of potential re-ordering of rows across multiple queries into account, and so the associated problem could happen to anyone at any time.
As for data which could trigger this, absolutely any data could, especially if the table is large or the database is a busy one. I can't speak for Postgres or MySQL as I don't really know their internals, but I know that Oracle uses a system of blocks on disk to represent tables, and when it needs a row it loads the entire block containing that row into a memory cache before searching through it. When selecting data from a table in Oracle, it will return rows from blocks it has already loaded into the cache first, then it will start loading the rest of the blocks into the cache. The cache is LRU, meaning the oldest unused block will get moved out once the cache is full. This means that a busy database with lots of queries, or one with a memory cache that is too small to hold the entire table, will have a memory cache that is constantly being changed, so that when executing an identical query you are unlikely to find the same blocks still in memory, and so you will get the rows in a different order. I realise order-by would slow down the initial mart query significantly. Still, if using OFFSET with LIMIT over multiple queries, order-by is the only way to guarantee you will retrieve the correct entire result set without any rows going missing or becoming mysteriously duplicated. I'd like to suggest investigating streaming results from a single query. Not sure what the technicalities are, especially when it comes to time-outs on long-running queries, but it removes the necessity of using OFFSET and LIMIT, and it also makes the RDBMS execute the query only once, so the total set of rows should come back faster. You could still retain LIMIT for the preview screen, but that should be the only place it is needed. Just a suggestion though... I have no idea whether that would actually be feasible! :) cheers, Richard 2008/7/15 Arek Kasprzyk <[EMAIL PROTECTED]>: > > On 15-Jul-08, at 10:15 AM, Richard Holland wrote: > >> Olivier is right. >> >> BioMart has been built around the rather convenient but undocumented >> feature (or bug?) of MySQL that identical queries will always return >> results in the same order, thus allowing LIMIT to be used without >> ORDER BY. >> Postgres, being a 'clever' database, in common with Oracle, does not >> guarantee row order because it does not process rows in the simplistic >> sequential file access method that MySQL does. Therefore any LIMIT >> condition _must_ be accompanied by an ORDER BY. > > Richard, > not sure if not returning rows in expected order makes a database a > particularly clever one ;) > and would not call returning rows in expected order a bug either ;) > > but seriously, MySQL is no different than Postgress or oracle in this > respect. > MySQL manual states explicitly (as any other rdbms for that matter) that the > row order during retrieval is not guaranteed. > This in particular true when after creating table a number of rows have been > deleted or re-inserted. > In practice however as per our testing row order for read only data > tends to be maintained to a degree allowing batching system to work across > platforms. > (Oliver - it would be helpful for us to look into your data closer to see > how/when this order breaks > so if you could get in touch with us off the list we could have a look) > Adding 'order by' for batching is obviously a no brainer :) - however this > has a rather dramatic > negative impact on the performance as you may expect - rendering batching > useless > so not really a solution either. > > BTW, we are currently thinking about replacing batching with an alternative > approach but this is > not yet tested so can't really comment on that > > > a. > > > >> >> The only reason BioMart uses LIMIT is to try and speed up processing >> by using an internal process called 'batching' to break up results >> into smaller units to process. Even in MySQL this is not an optimum >> setup because MySQL still executes and processes all the rows up to >> the start of the OFFSET, meaning that the number of rows MySQL has to >> process is in the order of O(n squared). >> >> In other words, the only way to make BioMart play nicely with Oracle >> and Postgres is to modify it to use ORDER BY in all queries to force >> the RDBMS to return rows in the same order every time, therefore >> allowing LIMIT to work in the way it does in MySQL (and thus the way >> BioMart expects it to work). >> >> How this might be achieved though I don't know. >> >> cheers, >> Richard >> >> >> >> 2008/7/15 Olivier Arnaiz <[EMAIL PROTECTED]>: >>> >>> Hmm ... no... I am talking about 'view all' and the export into file, not >>> about the preview section. >>> Apparently for postgres the ORDER BY should be necessary... Probably not >>> for >>> MySql. >>> Thanks >>> Olivier >>> Le 15 juil. 08 à 11:08, Syed Haider a écrit : >>> >>> Hi Olivier, >>> I guess you are concerned about the preview section of results in >>> MartView. The preview section certainly does not guarantee the order and >>> it does not offer you the controls to navigate through a slice of >>> results starting from an OFFSET to a LIMIT. This function will be >>> offered in the next release. And by doing this, we will make sure that >>> the ordering of results is consistent too. >>> >>> cheers >>> syed >>> On Tue, 2008-07-15 at 10:57 +0200, Olivier Arnaiz wrote: >>> >>> Hi, >>> >>> Actually my table is already built around a sorted column. >>> >>> This is a naive question but is it a difference between postgreSQL >>> and MySQL ? (see this page) >>> http://www.postgresql.org/docs/current/interactive/queries-limit.html >>> >>> "When using LIMIT, it is important to use an ORDER BY clause that >>> constrains the result rows into a unique order. Otherwise you will get >>> an unpredictable subset of the query's rows. You might be asking for >>> the tenth through twentieth rows, but tenth through twentieth in what >>> ordering? The ordering is unknown, unless you specified ORDER BY. " >>> >>> Thanks for your help >>> Olivier >>> >>> >>> >>> Le 11 juil. 08 à 20:29, Richard Holland a écrit : >>> >>> I'm guessing that you're seeing duplicate results in your output, or >>> some that are completely missing, or a mixture of both? >>> >>> Both .... because the results are not sorted >>> >>> Le 12 juil. 08 à 01:15, Syed Haider a écrit : >>> >>> Hi Olivier, >>> >>> If I understand you correctly, and you want to have ORDER BY clause >>> in >>> biomart-perl query builder, I would suggest to build your table >>> around >>> the required ORDER BY columns. If you execute ORDER BY on the fly >>> (via >>> biomart-perl), this will be done with each batch of the query and >>> similarly every time a user/new user asks for results with different >>> attributes/filters. This would make your query response very slow. >>> If >>> you know the bottle neck which i guess is the case, better fix it on >>> the >>> table level. >>> >>> regards >>> syed >>> >>> >>> >>> On Fri, 2008-07-11 at 17:22 +0200, Olivier Arnaiz wrote: >>> >>> Hello, >>> >>> >>> >>> I am using biomart with a postgres database (7.4). >>> And all seem work perfectly with the count of the results but I >>> can >>> not obtain a good table with an 'export' or a 'view all results'. >>> >>> >>> >>> When I see the log4 log, I can remark that the query is split in >>> many sql >>> using LIMIT and OFFSET... So why not ... >>> but I do not see an ORDER BY command and this parameter seems to >>> be >>> necessary to obtain my good results... >>> >>> >>> >>> SELECT main.name,[dm_table].field FROM main, dm WHERE >>> main.key=dm.key AND dm.field =t LIMIT 200 >>> SELECT main.name,[dm_table].field FROM main, dm WHERE >>> main.key=dm.key AND dm.field =t LIMIT 200 OFFSET 200 >>> SELECT main.name,[dm_table].field FROM main, dm WHERE >>> main.key=dm.key AND dm.field =t LIMIT 400 OFFSET 400 >>> >>> >>> >>> instead of : >>> SELECT main.name,[dm_table].field FROM main, dm WHERE >>> main.key=dm.key AND dm.field =t ORDER BY main.key LIMIT 200 >>> SELECT main.name,[dm_table].field FROM main, dm WHERE >>> main.key=dm.key AND dm.field =t ORDER BY main.key LIMIT 200 OFFSET >>> 200 >>> SELECT main.name,[dm_table].field FROM main, dm WHERE >>> main.key=dm.key AND dm.field =t ORDER BY main.key LIMIT 400 OFFSET >>> 400 >>> >>> >>> >>> >>> >>> >>> >>> That is a problem with postgres versus MySQL (I do not have >>> any experience with MySQL) or a problem with my config ... or >>> something else ? >>> >>> >>> >>> Any suggestions >>> Thanks >>> >>> >>> >>> Olivier >>> >>> >>> >>> >>> >>> -- >>> Olivier Arnaiz >>> CGM-CNRS >>> 91198 Gif-sur-Yvette >>> 01 69 82 43 75 >>> >>> >>> >>> ParameciumDB: >>> http://paramecium.cgm.cnrs-gif.fr/ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> ====================================== >>> Syed Haider. >>> EMBL-European Bioinformatics Institute >>> Wellcome Trust Genome Campus, Hinxton, >>> Cambridge CB10 1SD, UK. >>> ====================================== >>> >>> >>> -- >>> Olivier Arnaiz >>> CGM-CNRS >>> 91198 Gif-sur-Yvette >>> 01 69 82 43 75 >>> >>> ParameciumDB: >>> http://paramecium.cgm.cnrs-gif.fr/ >>> >>> >>> >>> >>> >>> >>> -- >>> ====================================== >>> Syed Haider. >>> EMBL-European Bioinformatics Institute >>> Wellcome Trust Genome Campus, Hinxton, >>> Cambridge CB10 1SD, UK. >>> ====================================== >>> >>> -- >>> Olivier Arnaiz >>> CGM-CNRS >>> 91198 Gif-sur-Yvette >>> 01 69 82 43 75 >>> ParameciumDB: >>> http://paramecium.cgm.cnrs-gif.fr/ >>> >>> >>> >>> > >
