Arek Kasprzyk wrote:
On 23 Aug 2006, at 15:48, Tom Oinn wrote:

Damian Smedley wrote:

How about a clear policy as to what forms of access are legal - a sensible service interface suggests that bulk querying is legitimate surely?
I don't want to ban people doing bulk querying though putting all the IDs into one query is obviously much more efficient.
I'm not sure it's always equivalent though? I agree it's a problem, obviously if the server's going down you need to do something to resolve that but hopefully we can work out some kind of best practice / code change in Taverna to help out as well.

David Withers is our developer for the biomart side of things, I now know relatively little of how it works internally but I believe he's on the list as well :)

Cheers,

Tom


ok, some more details about this problem. I hope we can work this out together as we do not want to ban anybody from doing anything but simply to optimize the access so it is works in
an optimal way for taverna as well for us.
(apologies for the massive cross-posting but not sure what list all the relevant people are subscribed to :)) please feel free to redirect, narrow down this discussion or even reject if do not recognize taverna
request pattern :)

This should be going to taverna-hackers, everyone appropriate is on there I think, add mart-dev if there are people your end who need to see it.


ok, here it goes:
BioMart central server went down twice after a series of over 100 000 requests coming from a single source over a relatively short period of time. After analyzing the access logs and contacting the guys who were firing those requests it seems that they have originated from taverna workflows.

the requests came in the following pattern:

<snip>

after further analyzing the logs it seems like those users wanted sequences for a ~300 ensembl transcripts. This in itself is a perfectly valid and sensible use case. However, what is unclear to me is why it is necessary to request each sequence individually and more importantly why for each query the software (taverna?) needs to undergo a full configuration (as above). surely this could be done once and then be followed either by individual queries if necessary or better still by less queries doing requests in batches. This is normally is a light weight and sensible request when done properly. For a comparison I enclose below an example of exactly the same usage but sent as a single query and a small perl script which quickly and harmlessly retrieves it from our web-service so you can run and compare.

In this case that sounds plausible, the time when you want to run one query per identifier is where you're getting more than one result returned and want to maintain the mapping from input to output. This could be done by altering the workflow as well but the most obvious way to use the biomart process within taverna will tend to make lots of distinct queries.

The issue with retrieval of dataset configs has I think been fixed in CVS but David can confirm or deny that. That should massively reduce the number of queries once we deploy the new code.

Cheers,

Tom

Reply via email to