On 23 Aug 2006, at 15:48, Tom Oinn wrote:

Damian Smedley wrote:

How about a clear policy as to what forms of access are legal - a sensible service interface suggests that bulk querying is legitimate surely?
I don't want to ban people doing bulk querying though putting all the IDs into one query is obviously much more efficient.

I'm not sure it's always equivalent though? I agree it's a problem, obviously if the server's going down you need to do something to resolve that but hopefully we can work out some kind of best practice / code change in Taverna to help out as well.

David Withers is our developer for the biomart side of things, I now know relatively little of how it works internally but I believe he's on the list as well :)

Cheers,

Tom


ok, some more details about this problem. I hope we can work this out together as we do not want to ban anybody from doing anything but simply to optimize the access so it is works in
an optimal way for taverna as well for us.
(apologies for the massive cross-posting but not sure what list all the relevant people are subscribed to :)) please feel free to redirect, narrow down this discussion or even reject if do not recognize taverna
request pattern :)

ok, here it goes:
BioMart central server went down twice after a series of over 100 000 requests coming from a single source over a relatively short period of time. After analyzing the access logs and contacting the guys who were firing those requests it seems that they have originated from taverna workflows.

the requests came in the following pattern:


- - [18/Aug/2006:22:12:33 +0100] "GET /biomart/martservice?type=datasets&mart=sequence HTTP/1.1" 200 1503 - - [18/Aug/2006:22:12:33 +0100] "GET /biomart/martservice?type=datasets&mart=sequence HTTP/1.1" 200 1503 - - [18/Aug/2006:22:12:34 +0100] "GET /biomart/martservice?type=datasets&mart=snp HTTP/1.1" 200 640 - - [18/Aug/2006:22:12:34 +0100] "GET /biomart/martservice?type=datasets&mart=snp HTTP/1.1" 200 640 - - [18/Aug/2006:22:12:34 +0100] "GET /biomart/martservice?type=datasets&mart=vega HTTP/1.1" 200 343 - - [18/Aug/2006:22:12:34 +0100] "GET /biomart/martservice?type=datasets&mart=vega HTTP/1.1" 200 343 - - [18/Aug/2006:22:12:34 +0100] "GET /biomart/martservice?type=datasets&mart=uniprot HTTP/1.1" 200 490 - - [18/Aug/2006:22:12:34 +0100] "GET /biomart/martservice?type=datasets&mart=uniprot HTTP/1.1" 200 490 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice?type=datasets&mart=msd HTTP/1.1" 200 74 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice?type=datasets&mart=msd HTTP/1.1" 200 74 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice?type=datasets&mart=wormbase HTTP/1.1" 200 336 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice?type=datasets&mart=wormbase HTTP/1.1" 200 336 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice? type=configuration&dataset=hsapiens_genomic_sequence&virtualschema=defau lt HTTP/1.1" 200 9161 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice? type=configuration&dataset=hsapiens_genomic_sequence&virtualschema=defau lt HTTP/1.1" 200 9161 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice? query=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF -8%22%3F%3E%0D%0A%3C%21DOCTYPE+Query%3E%0D%0A%3CQuery+virtualSchemaName% 3D%22default%22+count%3D%220%22%3E%3CDataset+name%3D%22hsapiens_gene_ens embl_structure%22%3E%3CAttribute+name%3D%223utr_start%22+%2F%3E%3CAttrib ute+name%3D%223utr_end%22+%2F%3E%3CAttribute+name%3D%22gene_stable_id_v% 22+%2F%3E%3CAttribute+name%3D%22transcript_stable_id%22+%2F%3E%3CAttribu te+name%3D%22str_chrom_name%22+%2F%3E%3C%2FDataset%3E%3CDataset+name%3D% 22hsapiens_genomic_sequence%22%3E%3CAttribute+name%3D%223utr%22+%2F%3E%3 C%2FDataset%3E%3CDataset+name%3D%22hsapiens_gene_ensembl%22%3E%3CFilter+ name%3D%22ensembl_transcript_id%22+value%3D%22ENST00000358646%22+%2F%3E% 3C%2FDataset%3E%3CLinks+source%3D%22hsapiens_gene_ensembl%22+target%3D%2 2hsapiens_gene_ensembl_structure%22+defaultLink%3D%22hsapiens_internal_t ranscript_id%22+%2F%3E%3CLinks+source%3D%22hsapiens_gene_ensembl_structu re%22+target%3D%22hsapiens_genomic_sequence%22+defaultLink%3D%223utr%22+ %2F%3E%3C%2FQuery%3E%0D%0A HTTP/1.1" 200 5 - - [18/Aug/2006:22:12:35 +0100] "GET /biomart/martservice? query=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF -8%22%3F%3E%0D%0A%3C%21DOCTYPE+Query%3E%0D%0A%3CQuery+virtualSchemaName% 3D%22default%22+count%3D%220%22%3E%3CDataset+name%3D%22hsapiens_gene_ens embl_structure%22%3E%3CAttribute+name%3D%223utr_start%22+%2F%3E%3CAttrib ute+name%3D%223utr_end%22+%2F%3E%3CAttribute+name%3D%22gene_stable_id_v% 22+%2F%3E%3CAttribute+name%3D%22transcript_stable_id%22+%2F%3E%3CAttribu te+name%3D%22str_chrom_name%22+%2F%3E%3C%2FDataset%3E%3CDataset+name%3D% 22hsapiens_genomic_sequence%22%3E%3CAttribute+name%3D%223utr%22+%2F%3E%3 C%2FDataset%3E%3CDataset+name%3D%22hsapiens_gene_ensembl%22%3E%3CFilter+ name%3D%22ensembl_transcript_id%22+value%3D%22ENST00000358646%22+%2F%3E% 3C%2FDataset%3E%3CLinks+source%3D%22hsapiens_gene_ensembl%22+target%3D%2 2hsapiens_gene_ensembl_structure%22+defaultLink%3D%22hsapiens_internal_t ranscript_id%22+%2F%3E%3CLinks+source%3D%22hsapiens_gene_ensembl_structu re%22+target%3D%22hsapiens_genomic_sequence%22+defaultLink%3D%223utr%22+ %2F%3E%3C%2FQuery%3E%0D%0A HTTP/1.1" 200 5



after further analyzing the logs it seems like those users wanted sequences for a ~300 ensembl transcripts. This in itself is a perfectly valid and sensible use case. However, what is unclear to me is why it is necessary to request each sequence individually and more importantly why for each query the software (taverna?) needs to undergo a full configuration (as above). surely this could be done once and then be followed either by individual queries if necessary or better still by less queries doing requests in batches. This is normally is a light weight and sensible request when done properly. For a comparison I enclose below an example of exactly the same usage but sent as a single query and a small perl script which quickly and harmlessly retrieves it from our web-service so you can run and compare.



any advice on how to optimize this is greatly appreciated
a.




the example of a 'harmless' code is below, please run it as follows:

perl webExample.pl SequenceQuery.xml



webExample.pl
-------------------------------

use strict;
use LWP::UserAgent;


open (FH,$ARGV[0]) || die ("need a Query xml file name");

my $xml;
while (<FH>){
    $xml .= $_;
}
close(FH);


my $path="http://dev.biomart.org/biomart/martservice?";;
my $request = HTTP::Request->new("POST",$path,HTTP::Headers- >new(),'query='.$xml."\n");
my $ua = LWP::UserAgent->new;

my $response;

$ua->request($request,
             sub{
                 my($data, $response) = @_;
                 if ($response->is_success) {
                     print "$data";
                 }
                 else {
warn ("Problems with the web server: ".$response->status_line);
                 }
             },1000);





SequenceQuery.xml
----------------------
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" count = "0" >
          <Dataset name = "hsapiens_gene_ensembl">
<Filter name = "ensembl_transcript_id" value
          </Dataset>

          <Links source = "hsapiens_gene_ensembl"
                 target = "hsapiens_gene_ensembl_structure"
                 defaultLink = "hsapiens_internal_transcript_id" />

          <Dataset name = "hsapiens_gene_ensembl_structure">
                 <Attribute name = "gene_stable_id"/>
                 <Attribute name = "str_chrom_name"/>
                 <Attribute name = "biotype"/>
          </Dataset>

          <Links source = "hsapiens_gene_ensembl_structure"
                 target = "hsapiens_genomic_sequence"
                 defaultLink = "cdna" />

          <Dataset name = "hsapiens_genomic_sequence">
              <Attribute name = "cdna"/>
          </Dataset>
</Query>










------------------------------------------------------------------------ -------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
------------------------------------------------------------------------ -------



Reply via email to