On 23 Aug 2006, at 15:48, Tom Oinn wrote:
Damian Smedley wrote:
How about a clear policy as to what forms of access are legal - a
sensible service interface suggests that bulk querying is legitimate
surely?
I don't want to ban people doing bulk querying though putting all the
IDs into one query is obviously much more efficient.
I'm not sure it's always equivalent though? I agree it's a problem,
obviously if the server's going down you need to do something to
resolve that but hopefully we can work out some kind of best practice
/ code change in Taverna to help out as well.
David Withers is our developer for the biomart side of things, I now
know relatively little of how it works internally but I believe he's
on the list as well :)
Cheers,
Tom
ok, some more details about this problem. I hope we can work this out
together as we do not want
to ban anybody from doing anything but simply to optimize the access so
it is works in
an optimal way for taverna as well for us.
(apologies for the massive cross-posting but not sure what list all the
relevant people are subscribed to :))
please feel free to redirect, narrow down this discussion or even
reject if do not recognize taverna
request pattern :)
ok, here it goes:
BioMart central server went down twice after a series of over 100 000
requests coming from a single
source over a relatively short period of time. After analyzing the
access logs and contacting the guys who
were firing those requests it seems that they have originated from
taverna workflows.
the requests came in the following pattern:
- - [18/Aug/2006:22:12:33 +0100] "GET
/biomart/martservice?type=datasets&mart=sequence HTTP/1.1" 200 1503
- - [18/Aug/2006:22:12:33 +0100] "GET
/biomart/martservice?type=datasets&mart=sequence HTTP/1.1" 200 1503
- - [18/Aug/2006:22:12:34 +0100] "GET
/biomart/martservice?type=datasets&mart=snp HTTP/1.1" 200 640
- - [18/Aug/2006:22:12:34 +0100] "GET
/biomart/martservice?type=datasets&mart=snp HTTP/1.1" 200 640
- - [18/Aug/2006:22:12:34 +0100] "GET
/biomart/martservice?type=datasets&mart=vega HTTP/1.1" 200 343
- - [18/Aug/2006:22:12:34 +0100] "GET
/biomart/martservice?type=datasets&mart=vega HTTP/1.1" 200 343
- - [18/Aug/2006:22:12:34 +0100] "GET
/biomart/martservice?type=datasets&mart=uniprot HTTP/1.1" 200 490
- - [18/Aug/2006:22:12:34 +0100] "GET
/biomart/martservice?type=datasets&mart=uniprot HTTP/1.1" 200 490
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?type=datasets&mart=msd HTTP/1.1" 200 74
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?type=datasets&mart=msd HTTP/1.1" 200 74
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?type=datasets&mart=wormbase HTTP/1.1" 200 336
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?type=datasets&mart=wormbase HTTP/1.1" 200 336
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?
type=configuration&dataset=hsapiens_genomic_sequence&virtualschema=defau
lt HTTP/1.1" 200 9161
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?
type=configuration&dataset=hsapiens_genomic_sequence&virtualschema=defau
lt HTTP/1.1" 200 9161
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?
query=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF
-8%22%3F%3E%0D%0A%3C%21DOCTYPE+Query%3E%0D%0A%3CQuery+virtualSchemaName%
3D%22default%22+count%3D%220%22%3E%3CDataset+name%3D%22hsapiens_gene_ens
embl_structure%22%3E%3CAttribute+name%3D%223utr_start%22+%2F%3E%3CAttrib
ute+name%3D%223utr_end%22+%2F%3E%3CAttribute+name%3D%22gene_stable_id_v%
22+%2F%3E%3CAttribute+name%3D%22transcript_stable_id%22+%2F%3E%3CAttribu
te+name%3D%22str_chrom_name%22+%2F%3E%3C%2FDataset%3E%3CDataset+name%3D%
22hsapiens_genomic_sequence%22%3E%3CAttribute+name%3D%223utr%22+%2F%3E%3
C%2FDataset%3E%3CDataset+name%3D%22hsapiens_gene_ensembl%22%3E%3CFilter+
name%3D%22ensembl_transcript_id%22+value%3D%22ENST00000358646%22+%2F%3E%
3C%2FDataset%3E%3CLinks+source%3D%22hsapiens_gene_ensembl%22+target%3D%2
2hsapiens_gene_ensembl_structure%22+defaultLink%3D%22hsapiens_internal_t
ranscript_id%22+%2F%3E%3CLinks+source%3D%22hsapiens_gene_ensembl_structu
re%22+target%3D%22hsapiens_genomic_sequence%22+defaultLink%3D%223utr%22+
%2F%3E%3C%2FQuery%3E%0D%0A HTTP/1.1" 200 5
- - [18/Aug/2006:22:12:35 +0100] "GET
/biomart/martservice?
query=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22UTF
-8%22%3F%3E%0D%0A%3C%21DOCTYPE+Query%3E%0D%0A%3CQuery+virtualSchemaName%
3D%22default%22+count%3D%220%22%3E%3CDataset+name%3D%22hsapiens_gene_ens
embl_structure%22%3E%3CAttribute+name%3D%223utr_start%22+%2F%3E%3CAttrib
ute+name%3D%223utr_end%22+%2F%3E%3CAttribute+name%3D%22gene_stable_id_v%
22+%2F%3E%3CAttribute+name%3D%22transcript_stable_id%22+%2F%3E%3CAttribu
te+name%3D%22str_chrom_name%22+%2F%3E%3C%2FDataset%3E%3CDataset+name%3D%
22hsapiens_genomic_sequence%22%3E%3CAttribute+name%3D%223utr%22+%2F%3E%3
C%2FDataset%3E%3CDataset+name%3D%22hsapiens_gene_ensembl%22%3E%3CFilter+
name%3D%22ensembl_transcript_id%22+value%3D%22ENST00000358646%22+%2F%3E%
3C%2FDataset%3E%3CLinks+source%3D%22hsapiens_gene_ensembl%22+target%3D%2
2hsapiens_gene_ensembl_structure%22+defaultLink%3D%22hsapiens_internal_t
ranscript_id%22+%2F%3E%3CLinks+source%3D%22hsapiens_gene_ensembl_structu
re%22+target%3D%22hsapiens_genomic_sequence%22+defaultLink%3D%223utr%22+
%2F%3E%3C%2FQuery%3E%0D%0A HTTP/1.1" 200 5
after further analyzing the logs it seems like those users wanted
sequences for a ~300 ensembl transcripts. This in itself is a perfectly
valid and sensible use case.
However, what is unclear to me is why it is necessary to request each
sequence individually and more importantly why for each query the
software
(taverna?) needs to undergo a full configuration (as above). surely
this could be done once and then be followed either by individual
queries if necessary or
better still by less queries doing requests in batches. This is
normally is a light weight and sensible request when done properly. For
a comparison
I enclose below an example of exactly the same usage but sent as a
single query and a small perl script which quickly and harmlessly
retrieves it from our web-service so you can run and compare.
any advice on how to optimize this is greatly appreciated
a.
the example of a 'harmless' code is below, please run it as follows:
perl webExample.pl SequenceQuery.xml
webExample.pl
-------------------------------
use strict;
use LWP::UserAgent;
open (FH,$ARGV[0]) || die ("need a Query xml file name");
my $xml;
while (<FH>){
$xml .= $_;
}
close(FH);
my $path="http://dev.biomart.org/biomart/martservice?";
my $request =
HTTP::Request->new("POST",$path,HTTP::Headers-
>new(),'query='.$xml."\n");
my $ua = LWP::UserAgent->new;
my $response;
$ua->request($request,
sub{
my($data, $response) = @_;
if ($response->is_success) {
print "$data";
}
else {
warn ("Problems with the web server:
".$response->status_line);
}
},1000);
SequenceQuery.xml
----------------------
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName = "default" count = "0" >
<Dataset name = "hsapiens_gene_ensembl">
<Filter name = "ensembl_transcript_id" value =
"ENST00000005198,ENST00000169298,ENST00000172229,ENST00000195173,ENST000
00216019,ENST00000216445,ENST00000216927,ENST00000217170,ENST00000218032
,ENST00000220325,ENST00000221996,ENST00000225603,ENST00000225719,ENST000
00225726,ENST00000226760,ENST00000227756,ENST00000232744,ENST00000236228
,ENST00000238667,ENST00000238823,ENST00000242455,ENST00000245541,ENST000
00246006,ENST00000251268,ENST00000251343,ENST00000252015,ENST00000254066
,ENST00000254998,ENST00000255017,ENST00000255613,ENST00000256183,ENST000
00256365,ENST00000257536,ENST00000258301,ENST00000259808,ENST00000260058
,ENST00000260357,ENST00000261249,ENST00000261312,ENST00000261834,ENST000
00262419,ENST00000262551,ENST00000263066,ENST00000263762,ENST00000263854
,ENST00000264033,ENST00000264444,ENST00000264959,ENST00000266304,ENST000
00267569,ENST00000269051,ENST00000270201,ENST00000271579,ENST00000272217
,ENST00000272442,ENST00000272444,ENST00000272462,ENST00000272748,ENST000
00273074,ENST00000273342,ENST00000273347,ENST00000274811,ENST00000275184
,ENST00000278916,ENST00000279463,ENST00000280734,ENST00000282990,ENST000
00285518,ENST00000285908,ENST00000287169,ENST00000287675,ENST00000288221
,ENST00000288304,ENST00000290921,ENST00000293686,ENST00000294644,ENST000
00295210,ENST00000295500,ENST00000295878,ENST00000295902,ENST00000296084
,ENST00000296288,ENST00000297922,ENST00000298130,ENST00000298229,ENST000
00298684,ENST00000298687,ENST00000299163,ENST00000299335,ENST00000299367
,ENST00000301633,ENST00000302347,ENST00000303415,ENST00000304330,ENST000
00304401,ENST00000305124,ENST00000307824,ENST00000308521,ENST00000308936
,ENST00000309009,ENST00000309050,ENST00000309117,ENST00000309186,ENST000
00310492,ENST00000310827,ENST00000311086,ENST00000311127,ENST00000311538
,ENST00000312108,ENST00000312438,ENST00000314013,ENST00000314963,ENST000
00315015,ENST00000315032,ENST00000316071,ENST00000318201,ENST00000318345
,ENST00000318950,ENST00000320362,ENST00000320683,ENST00000320912,ENST000
00321351,ENST00000321675,ENST00000323456,ENST00000323944,ENST00000324093
,ENST00000324106,ENST00000325630,ENST00000325811,ENST00000325863,ENST000
00326805,ENST00000327200,ENST00000327956,ENST00000328839,ENST00000328933
,ENST00000329333,ENST00000329454,ENST00000330871,ENST00000331163,ENST000
00331285,ENST00000331396,ENST00000331662,ENST00000331814,ENST00000332362
,ENST00000332503,ENST00000332729,ENST00000332859,ENST00000333039,ENST000
00334456,ENST00000335157,ENST00000335509,ENST00000335585,ENST00000335725
,ENST00000336079,ENST00000336219,ENST00000337190,ENST00000338197,ENST000
00338876,ENST00000339174,ENST00000339700,ENST00000340067,ENST00000340149
,ENST00000340366,ENST00000340539,ENST00000340722,ENST00000341280,ENST000
00341810,ENST00000342745,ENST00000343674,ENST00000343780,ENST00000343999
,ENST00000344021,ENST00000344103,ENST00000344643,ENST00000344973,ENST000
00345097,ENST00000349767,ENST00000350051,ENST00000350792,ENST00000351904
,ENST00000352360,ENST00000352367,ENST00000353194,ENST00000354245,ENST000
00354360,ENST00000354749,ENST00000355139,ENST00000355329,ENST00000355666
,ENST00000355771,ENST00000355905,ENST00000356438,ENST00000356444,ENST000
00356978,ENST00000357012,ENST00000357044,ENST00000357289,ENST00000357382
,ENST00000357591,ENST00000357599,ENST00000358620,ENST00000358646,ENST000
00358870,ENST00000359139,ENST00000359704,ENST00000360108,ENST00000360463
,ENST00000360612,ENST00000361058,ENST00000361204,ENST00000361276,ENST000
00361373,ENST00000361507,ENST00000361834,ENST00000362024,ENST00000362068
,ENST00000367281,ENST00000367558,ENST00000367567,ENST00000367570,ENST000
00367573,ENST00000367580,ENST00000367581,ENST00000368081,ENST00000368089
,ENST00000368414,ENST00000368415,ENST00000368704,ENST00000368705,ENST000
00368706,ENST00000368794,ENST00000369838,ENST00000370176,ENST00000370184
,ENST00000370185,ENST00000370194,ENST00000370195,ENST00000370197,ENST000
00370198,ENST00000370312,ENST00000370313,ENST00000370316,ENST00000370317
,ENST00000370694,ENST00000370695,ENST00000370696,ENST00000370698,ENST000
00370701,ENST00000370728,ENST00000372058,ENST00000372061,ENST00000372308
,ENST00000372315,ENST00000373620,ENST00000373624,ENST00000373960,ENST000
00374478,ENST00000374481,ENST00000374551,ENST00000374673,ENST00000374676
,ENST00000374948,ENST00000374983,ENST00000375261,ENST00000375561,ENST000
00376352,ENST00000376450,ENST00000376504,ENST00000376678,ENST00000376790
,ENST00000376791,ENST00000376792,ENST00000376793,ENST00000376963,ENST000
00377047,ENST00000377091,ENST00000377094,ENST00000377096,ENST00000377100
,ENST00000377103,ENST00000377503,ENST00000377523,ENST00000377663,ENST000
00377669,ENST00000377953,ENST00000378073,ENST00000378124,ENST00000379156
,ENST00000379599,ENST00000379600,ENST00000380409,ENST00000380864,ENST000
00380981,ENST00000381175,ENST00000381178,ENST00000381205,ENST00000381250
,ENST00000381551,ENST00000381580,ENST00000381583,ENST00000381605,ENST000
00382029,ENST00000382041,ENST00000382352,ENST00000382581,ENST00000382597
,ENST00000382599,ENST00000382615,ENST00000382751,ENST00000382868,ENST000
00382872,ENST00000382876,ENST00000382919,ENST00000382945,ENST00000382952
,ENST00000382972,ENST00000383071,ENST00000383072,ENST00000383078,ENST000
00383167,ENST00000383176,ENST00000383177,ENST00000383178,ENST00000383179
,ENST00000383203,ENST00000383204,ENST00000383205,ENST00000383210,ENST000
00383244,ENST00000383338,ENST00000383358,ENST00000383359,ENST00000383360
,ENST00000383362,ENST00000383367,ENST00000383439,ENST00000383458,ENST000
00383462,ENST00000383483,ENST00000383485,ENST00000383487,ENST00000383489
,ENST00000383509,ENST00000383568,ENST00000383601,ENST00000383603,ENST000
00383612,ENST00000383658,ENST00000383665,ENST00000383720,ENST00000383744
,ENST00000383745"/>
</Dataset>
<Links source = "hsapiens_gene_ensembl"
target = "hsapiens_gene_ensembl_structure"
defaultLink = "hsapiens_internal_transcript_id" />
<Dataset name = "hsapiens_gene_ensembl_structure">
<Attribute name = "gene_stable_id"/>
<Attribute name = "str_chrom_name"/>
<Attribute name = "biotype"/>
</Dataset>
<Links source = "hsapiens_gene_ensembl_structure"
target = "hsapiens_genomic_sequence"
defaultLink = "cdna" />
<Dataset name = "hsapiens_genomic_sequence">
<Attribute name = "cdna"/>
</Dataset>
</Query>
------------------------------------------------------------------------
-------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
------------------------------------------------------------------------
-------