Re: [mart-dev] uniqueRowsOnly(1) vs. DISTINCT SQL

Richard Holland Fri, 13 Feb 2009 03:57:16 -0800

SELECT DISTINCT only works if all data is coming from the same database,
and can thus be selected using a single SELECT statement.

BioMart is set up to be able to draw in and merge data from multiple
databases. Therefore a single SELECT statement to do the job is not
always possible, as SQL (well MySQL's version of SQL anyway) cannot do
cross-database joins, meaning that BioMart must run separate SQL
queries, one per database, and calculate the joins itself. By doing
this, it then also has to make its own implementations for other things
like unique rows on the combined results of the various queries.

The mechanism for calculating unique rows is pretty simple - each row of
results is calculated and a hashcode computed on the finished row. If
the hashcode matches any previous hashcode generated for the current
query, the row is a duplicate and is not outputted, otherwise it is
outputted and the hashcode added to the list of previously seen
hashcodes to check future rows against.

I agree though that it would be nice if in the case where all data is
coming from a single database with a single SELECT statement with the
unique rows option ticked, BioMart could maybe optimise a bit by
inserting SELECT DISTINCT. Maybe that's a good compromise, but I don't
know how technically feasible it might be?

Lastly, one thing to bear in mind when using BioMart and SELECT
DISTINCT, at least in MySQL, SELECT DISTINCT actually slows down
preview-style queries using LIMIT where the user only wants the first 10
rows or so, as it needs to compute the entire result set first before
outputting it. Without DISTINCT, it just fires out rows as it finds
them, meaning that the LIMIT query completes much faster.

cheers,
Richard

Kolja Henckel wrote:
> Hello there!
> 
> I use the BioMart Perl API and just figured out some "problem":
> 
> When using the option
> uniqueRowsOnly(1)
> the API fetches all rows matching the query and afterwards deletes the
> ones that are too much (so that only unique ones are returned or printed).
> 
> My problem is that I have about 1 mio. datarows and only 18 different
> values in the desired Attribute.
> This means that the query takes about 10 minutes for the result of 18
> values.
> Is it possible (or planned, or already implemented somewhere, somehow?)
> to implement the uniqueRowsOnly-option using the SELECT DISTINCT option
> of SQL?
> In this case the query should perform within seconds...
> 
> Cheers, Kolja
> 
> 
> PS: thanks for the great mart, anyway :)
> 

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: [email protected]
http://www.eaglegenomics.com/

Re: [mart-dev] uniqueRowsOnly(1) vs. DISTINCT SQL

Reply via email to