On 14 Dec 2006, at 09:36, michael watson ((IAH-C)) wrote:

Note that the filter in the first query is:

<ValueFilter name = 'start' value = '>=0' />

Not

<ValueFilter name = 'start' value = '0' />

I think the latter would return nothing at all....

-----Original Message-----
From: Durinck, Steffen (NIH/NCI) [F] [mailto:[EMAIL PROTECTED]
Sent: 14 December 2006 00:55
To: Arek Kasprzyk
Cc: michael watson (IAH-C)
Subject: RE: Confusion over biomart

Hi Arek,

Here are the XML queries:

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query
virtualSchemaName = 'default' count = '0' softwareVersion = '0.5'
requestId= 'biomaRt'> <Dataset name = 'hsapiens_gene_ensembl'><Attribute
name = 'ensembl_gene_id'/><ValueFilter name = 'start' value = '0'
/></Dataset></Query>

this query does not return ENSG00000000003

while the adding affy_hg_u95av2 (the XML query below) as attribute does
return this gene id.

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query
virtualSchemaName = 'default' count = '0' softwareVersion = '0.5'
requestId= 'biomaRt'> <Dataset name = 'hsapiens_gene_ensembl'><Attribute
name = 'ensembl_gene_id'/><Attribute name =
'affy_hg_u95av2'/><ValueFilter name = 'start' value = '0'
/></Dataset></Query>

Note that the filter in this example is not really useful and maybe is
the cause of this behaviour.
When I remove this filter and just query for all Ensembl gene ids (XML
query below) I do get ENSG00000000003

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query
virtualSchemaName = 'default' count = '0' softwareVersion = '0.5'
requestId= 'biomaRt'> <Dataset name = 'hsapiens_gene_ensembl'><Attribute
name = 'ensembl_gene_id'/></Dataset></Query>

best,
Steffen



Hi guys,
I did a little investigation and it seems that this problems boils down to
some peculiar mysql batching behaviour (As you both say the addition
of the '>0' filter does not make sense there without a chromosome anyway
but this is not the culprit). I am cc'ing dev as this maybe of general interest.

The requested query resolves to a server SQL batching requests as follows:

BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 10 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 10,10 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 20,20 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 40,40 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 80,80 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 160,160 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 320,320 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 640,640 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 1280,1280 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 2560,2560 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 5120,5120 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 10240,10240 BioMart.Dataset.TableSet:731:WARN> QUERY SQL: SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') LIMIT 20480,20480


if you just run query with one batch, the expected ensembl id appears at row number 21923. However when you try to find it in the batch 20480,20480 it is not there. On further investigation it appears that if you request the last batch with the end limit higher than the total row number to be returned mysql decides to change the ordering (which until now has kept properly) and 'looses' all identifiers from the last batch. The good news is however that you can simply fix it by making sure that the last batch contains no more
than the remaining total row number e.i compare the two requests below:

bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') limit 20480,10726" ensembl_mart_41 | grep ENSG00000000003 bigmac: ~[arek] mysql -uanonymous -hensembldb.ensembl.org -e "SELECT main.gene_stable_id FROM ensembl_mart_41.hsapiens_gene_ensembl__gene__main main WHERE (main.gene_chrom_start >= '0') limit 20480,10725" ensembl_mart_41 | grep ENSG00000000003
ENSG00000000003



In short, we'll try to see if it is possible to alter our batching logic such that
it works correctly with mysql
a.








-----Original Message-----
From: Arek Kasprzyk [mailto:[EMAIL PROTECTED]
Sent: Wed 12/13/2006 1:25 PM
To: Durinck, Steffen (NIH/NCI) [F]
Cc: michael watson (IAH-C)
Subject: Re: Confusion over biomart


On 13 Dec 2006, at 15:59, Steffen Durinck wrote:

2)  Why does the webservice not return

ENSG00000000003

when you do:

atb <- c("ensembl_gene_id")
alldata <- getBM(attributes=atb, filters="start", values=">0", mart)


But does it return this Ensembl identifier when adding e.g. the
affy_hg_u95av attribute

atb <- c("ensembl_gene_id", "affy_hg_u95av")

alldata <- getBM(attributes=atb, filters="start", values=">0", mart)


I don't know how this is possible and hope Arek can help clarifying
this.


guys - can you send me an xml for that? I'll have a look

----------------------------------------------------------------------- -

-------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
----------------------------------------------------------------------- -

-------






------------------------------------------------------------------------ -------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
------------------------------------------------------------------------ -------



Reply via email to