On 14 Dec 2006, at 10:24, Arek Kasprzyk wrote:

BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 10240,10240 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 20480,20480


if you just run query with one batch, the expected ensembl id appears at row number 21923. However when you try to find it in the batch 20480,20480 it is not there. On further investigation it appears that if you request the last batch with the end limit higher than the total row number to be returned mysql decides to change the ordering (which until now has kept properly) and 'looses' all identifiers from the last batch. The good news is however that you can simply fix it by making sure that the last batch contains no more
than the remaining total row number e.i compare the two requests below:

bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') limit 20480,10726" ensembl_mart_41 | grep ENSG00000000003 bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') limit 20480,10725" ensembl_mart_41 | grep ENSG00000000003
ENSG00000000003



In short, we'll try to see if it is possible to alter our batching logic such that
it works correctly with mysql



and one more 'interesting' thing :)

please note that when the query is run without a filter, the ordering remains correct irrespectively if the last batch size is more than the total number of remaining rows:

bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main limit 20480,10725" ensembl_mart_41 | moregene_stable_id
ENSG00000198114
ENSG00000198117
ENSG00000198118
ENSG00000198119
ENSG00000198121

is identical to:

bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main limit 20480,10726" ensembl_mart_41 | more
gene_stable_id
ENSG00000198114
ENSG00000198117
ENSG00000198118
ENSG00000198119
ENSG00000198121



However the addition of this filter breaks the ordering:

bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') limit 20480,10726" ensembl_mart_41 | more
gene_stable_id
ENSG00000150456
ENSG00000106077
ENSG00000118976
ENSG00000117020
ENSG00000199435

and in addition occasionally adds some empty rows at the beginning of the query (which you can't see here)


Therefore removing this filter as Steffen noticed 'fixes' the problem
We are going to check if mysql 5 behaves properly in this respect in which case upgrading to a new version should sort out the issue, if not we'll look if we can compensate for it on our side


a.





------------------------------------------------------------------------ -------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
------------------------------------------------------------------------ -------



Reply via email to