On 14 Dec 2006, at 10:24, Arek Kasprzyk wrote:
BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT
main.gene_stable_id FROM
ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE
(main.gene_chrom_start >= '0') LIMIT 10240,10240
BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT
main.gene_stable_id FROM
ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE
(main.gene_chrom_start >= '0') LIMIT 20480,20480
if you just run query with one batch, the expected ensembl id appears
at row number 21923. However
when you try to find it in the batch 20480,20480 it is not there. On
further investigation it appears that
if you request the last batch with the end limit higher than the total
row number to be returned mysql decides
to change the ordering (which until now has kept properly) and
'looses' all identifiers from the last batch.
The good news is however that you can simply fix it by making sure
that the last batch contains no more
than the remaining total row number e.i compare the two requests below:
bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT
main.gene_stable_id FROM
ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE
(main.gene_chrom_start >= '0') limit 20480,10726" ensembl_mart_41 |
grep ENSG00000000003
bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT
main.gene_stable_id FROM
ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE
(main.gene_chrom_start >= '0') limit 20480,10725" ensembl_mart_41 |
grep ENSG00000000003
ENSG00000000003
In short, we'll try to see if it is possible to alter our batching
logic such that
it works correctly with mysql
and one more 'interesting' thing :)
please note that when the query is run without a filter, the ordering
remains
correct irrespectively if the last batch size is more than the total
number of remaining rows:
bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT
main.gene_stable_id FROM
ensembl_mart_41.hsapiens_gene_ensembl__gene__main main limit
20480,10725" ensembl_mart_41 | moregene_stable_id
ENSG00000198114
ENSG00000198117
ENSG00000198118
ENSG00000198119
ENSG00000198121
is identical to:
bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT
main.gene_stable_id FROM
ensembl_mart_41.hsapiens_gene_ensembl__gene__main main limit
20480,10726" ensembl_mart_41 | more
gene_stable_id
ENSG00000198114
ENSG00000198117
ENSG00000198118
ENSG00000198119
ENSG00000198121
However the addition of this filter breaks the ordering:
bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT
main.gene_stable_id FROM
ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE
(main.gene_chrom_start >= '0') limit 20480,10726" ensembl_mart_41 |
more
gene_stable_id
ENSG00000150456
ENSG00000106077
ENSG00000118976
ENSG00000117020
ENSG00000199435
and in addition occasionally adds some empty rows at the beginning of
the query (which you can't see here)
Therefore removing this filter as Steffen noticed 'fixes' the problem
We are going to check if mysql 5 behaves properly in this respect in
which case upgrading to a new
version should sort out the issue, if not we'll look if we can
compensate for it on our side
a.
------------------------------------------------------------------------
-------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
------------------------------------------------------------------------
-------