Hey guys,

Thanks Rafa for remind us the minutes of the workshop, it is a good place to 
start.

Andy, what I meant as 'non-overlaping' strategy, was no really dividing the 
segment into bins, but using the numbers of requested bins just as the maximum 
number of features to return in the segment, which in that case is can be 
selected in a random way or by size, so the inclusion in the response of one 
feature in conditioned to no overlap with a previous selected feature. That 
obviously is no necessarily the most representative set of features but again I 
think that works in order to having a very simple maxbins implementation, 
although i can see that it doesn't really goes with the concept of bins.

In the cases where the goal is to represent the score the way to go IMO is to 
create a feature that represent the bin(avg, max or min). And for those cases 
your question 1 is solved, because the feature is contributing to all the bins 
that is covering, we can think in a formula that consider the number of bases 
that features contribute to a bin(see below), and in this way give a more 
representative ranking. In the cases that the score is meaningless, the same 
feature can be used in all the bins that is covering.
About the rounding issue I think thats a complete problem of the server, 
meaning that if the client is asking for 700 bins, it shouldn't care about how 
it was calculated and the response should be the 700 features that he requested.
From the side of the server we can divide the problem in the ones based on 
continuos values, usually based on the score,  and the ones with discrete 
values(frequency of types, etc). I wont go in this email to the discrete case 
because i haven't though about it :-)

**WARNING** Dunno how this is going to look in pure text.
**WARNING** calculation in the example may not be exact.

For the continuos case, even that conceptually speaking is not possible to 
speak about a fraction of a base, we can know what is the proportion of its 
contribution for an specific bean. And from there we can calculate the score of 
the compendium feature(thinking in average), which will represent the bin by 
the formulae:

Sb = ∑(scoren * contributionn) / (length/#bins)

For instance, for a segment  request of 7 bases with maxbins of 3 we can have 
an annotation with its score per base like this:

|0.5|0.4|0.5|0.4|0.4|0.5|0.6|

and then the score for the 3 bins will be:

S1 = (0.5*1 + 0.4*1 + 0.5*0.33)/2.33    = 0.457
S2 = (0.5*0.66 + 0.4*1 + 0.4*0.66)/2.33         = 0.426
S3 = (0.4*0.33 + 0.5*1 + 0.6*1)/2.33    = 0.528

So the client receives 3 bins as requested an is its responsibility how to draw 
those but he asked for this number so we can assume that the client knows what 
it is doing.

In the cases where there are positions with non annotations or more than one 
annotation per position, we may generalise the formulae by changing the divisor 
to ∑(contributionn)

Sb = ∑(scoren * contributionn) / ∑(contributionn)

I know that won't work for all the cases but I think it is one of the 
strategies to implement and that the data provider can choose to use or not.

What do you guys think?

Cheers,

Gustavo.

On 6 Jul 2011, at 17:54, Andy Jenkinson wrote:

> Hi Gustavo,
> 
> I'm not sure what you mean by "non-overlapping" features exactly. For a 
> maxbins of 700, the thing to avoid IMO is returning the first (or random) 700 
> features because this does not divide the segment up into "bins". If all 700 
> are in the first bin (which might conceivably be represented by a single 
> pixel on the client), the client might still only be able to show one/a few, 
> whilst the rest of the pixels show nothing.
> 
> Each procedure should divide the segment into 700 bins, and sort features 
> into them by position. Some features might fall into only one bin, others 
> might fall into more than one and may or may not be overlapping. A filter 
> should then be applied to choose which features in each bin should be 
> returned. Such a filter might be based on score (e.g. highest, lowest, sum, 
> mean/median/mode averages), or some other factor. You could also create a new 
> feature representing all those in the bin (e.g. a feature count).
> 
> In my mind there are two complicating factors:
> 
> 1. How to decide which features to return if they are of different sizes and 
> cover a different number of bins.
> 
> Perhaps it makes sense if a feature is selected for one bin that it should be 
> selected for all the other bins it covers, for example, but what if the 
> algorithm is using highest score and the feature selected for the first bin 
> has a substantially lower score than another feature in the second bin?
> 
> 2. Rounding. If you're tired or losing interest it's probably best to stop 
> reading now...
> 
> Consider segment X:12345,98765 (86421 bases) and a maxbins of 1000. Assuming 
> bins are of equal size, each is 86.421 bases long. Obviously it's not 
> possible to express fractions of a base in DAS, so it is important that the 
> server and the client interpret this in the same way. Firstly it's important 
> not to round the bin size at the beginning, which would create an error in 
> the total length or number of bins. So the first bin is >= 12345 and <= 
> 12431.421, and the second is > 2431.421 and <= 12517.842. Which bin(s) does 
> X:12431 fall into? You might be tempted to say "easy, it's in bin 1". But 
> would your answer change if it was a feature at X:12400,12431 [which really 
> means X:12400,12431.99999], or a feature at X:12431,12500? Basically what I 
> am getting at is, do we count an end position as being 12431.0 or 
> 12431.9999999? I believe Ensembl does the former but am not 100% sure and 
> this is probably not strictly speaking accurate.
> 
> Sorry about the complicated numbers... 
> 
> Cheers,
> Andy
> 
> On 6 Jul 2011, at 14:53, Gustavo Salazar wrote:
> 
>> Hey guys,
>> 
>> One of the topics in the workshop was the idea of having a set of strategies 
>> for maxbins, and we said we will discuss it here... so this is my call to 
>> hear ideas about it, i might have a some spare time soon and if we get a 
>> couple of good strategies I can implement them in mydas as part of its core, 
>> so a datasource provider can choose to use one of the predefined strategies 
>> or to define a particular algorithm if is their wish. 
>> 
>> I suppose the easiest maxbins strategy is to return the X random 
>> non-overlaping features in he segment. 
>> 
>> Any other Ideas?
>> 
>> Regards,
>> 
>> Gustavo.
>> 
>> 
>> _______________________________________________
>> DAS mailing list
>> [email protected]
>> http://lists.open-bio.org/mailman/listinfo/das
> 


_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Reply via email to