Re: [DAS] maxbins

Andy Jenkinson Wed, 13 Jul 2011 06:28:35 -0700

I think there just needs to be a consensus on how the conversion is done so 
that the server can give coordinates that make sense for how it will be drawn 
by the client.


I am investigating how Ensembl does it, which appears in brief to be:
1. multiply the position by the number of pixels per base pair
2. round down
The way it does it is fairly complicated as I think it uses relative positions 
(i.e. segment start plus feature offset), converts back and forth from start + 
length rather than start + end, and some parts are zero-indexed whereas others 
are 1-indexed. But I *think* in effect it rounds both the start and end 
positions down. I have however come across some things that look strange which 
I am hoping the Ensembl web team will help me look into.

Cheers,
Andy

On 11 Jul 2011, at 12:05, Gustavo Salazar wrote:

> I finally understood what is the issue you were pointing at... i was looking 
> the problem just from the point of the view of calculating the score as a 
> result and not about the coordinates, somehow I always though in this as a 
> set of pseudo-features that are representing the original ones, where each 
> feature represents a bin... but now I see the point that cannot be like that 
> cause the clients are developed by working with feature coordinates and not 
> bin position.
> So any suggestion of how to deal with these?
> 
> 
> 
> On 7 Jul 2011, at 19:09, Andy Jenkinson wrote:
> 
>> On 7 Jul 2011, at 16:45, Gustavo Salazar wrote:
>> 
>>> Hey guys,
>>> 
>>> Thanks Rafa for remind us the minutes of the workshop, it is a good place 
>>> to start.
>>> 
>>> Andy, what I meant as 'non-overlaping' strategy, was no really dividing the 
>>> segment into bins, but using the numbers of requested bins just as the 
>>> maximum number of features to return in the segment, which in that case is 
>>> can be selected in a random way or by size, so the inclusion in the 
>>> response of one feature in conditioned to no overlap with a previous 
>>> selected feature. That obviously is no necessarily the most representative 
>>> set of features but again I think that works in order to having a very 
>>> simple maxbins implementation, although i can see that it doesn't really 
>>> goes with the concept of bins.
>> 
>> What I am trying to get across is that "maxbins=700" is NOT "I want 700 
>> features". It is "I have 700 pixels/subdivisions/something abstract to 
>> render this segment in". It is fine to have a simple binning strategy of 
>> "non-overlapping features" (although I don't like random selection as this 
>> will be inconsistent between page refreshes). But this does not mean that 
>> you will always get 700 features. The point is that the client is telling 
>> the server how much space the features will be rendered in, and it is up to 
>> the server to decide what is an appropriate number and distribution of 
>> features to return. So a more meaningful implementation of "non-overlapping 
>> features" would be to provide ALL non-overlapping features, where overlaps 
>> are BY BIN not by position. I may be missing something, but I can't see a 
>> case for wanting to give the first X features regardless of where they 
>> appear in the query segment.
>> 
>>> In the cases where the goal is to represent the score the way to go IMO is 
>>> to create a feature that represent the bin(avg, max or min). And for those 
>>> cases your question 1 is solved, because the feature is contributing to all 
>>> the bins that is covering, we can think in a formula that consider the 
>>> number of bases that features contribute to a bin(see below), and in this 
>>> way give a more representative ranking. In the cases that the score is 
>>> meaningless, the same feature can be used in all the bins that is covering.
>> 
>> This is one interpretation of a score-based algorithm, assuming that a 
>> feature is simply a score for the region of the genome. This is fine, and 
>> comes under the category of replacing the feature with a new one (although 
>> there is an issue of how notes/types etc are merged). However it is also 
>> quite feasible to have features that _have_ scores, but are not themselves 
>> _purely_ scores. Hypothetically, let's say genes are ranked by score, and as 
>> we zoom out we want to see the highest-ranking genes in the region. The 
>> algorithm would be expected to filter features, not modify/replace them. It 
>> is not an insurmountable problem, it just needs to be recognised that there 
>> is a difference between giving each bin a score, and choosing features in 
>> bins based on score.
>> 
>>> About the rounding issue I think thats a complete problem of the server, 
>>> meaning that if the client is asking for 700 bins, it shouldn't care about 
>>> how it was calculated and the response should be the 700 features that he 
>>> requested.
>> 
>> Again, 700 bins does NOT mean 700 features. Remember also that bins is an 
>> expression of something that is specific to the client. When rendering those 
>> bins, the client must convert between position and bin so it is important 
>> that it does so in the same way that the server did it. The client 
>> internally has one view of the positions that are represented by each bin, 
>> and if the rounding is not the same then it will render the feature "out by 
>> one bin".
>> 
>> Consider this example. My client has divided a segment of 1001 bases into 
>> 1000 bins. Note 1 of these bases has to go into a bin with another base. 
>> After handing that to the server, let's imagine that the server assigns a 
>> score to each bin and returns a feature for each bin like this:
>> feature 1 = start 1, end 2 = score 56
>> feature 2 = start 3, end 3 = score 14
>> feature 3 = start 4, end 4 = score 97
>> ...
>> feature 999 = start 999, end 999 = score 11
>> feature 1000 = start 1000, end 1000 = score 65
>> 
>> The client then takes the start/end positions and converts them into 
>> rendering space:
>> 
>> bin 1 = feature 1 = score 56
>> bin 2 = feature 1 = score 56
>> bin 3 = feature 2 = score 14
>> bin 4 = feature 3 = score 97
>> ...
>> bin 1000 = feature 999 + feature 1000 = score ???
>> 
>> Now what has happened here is that the server and client have not used the 
>> same rounding mechanism for assigning positions into bins, and thus produce 
>> different results.
>> 
>>> From the side of the server we can divide the problem in the ones based on 
>>> continuos values, usually based on the score,  and the ones with discrete 
>>> values(frequency of types, etc). I wont go in this email to the discrete 
>>> case because i haven't though about it :-)
>>> 
>>> **WARNING** Dunno how this is going to look in pure text.
>>> **WARNING** calculation in the example may not be exact.
>>> 
>>> For the continuos case, even that conceptually speaking is not possible to 
>>> speak about a fraction of a base, we can know what is the proportion of its 
>>> contribution for an specific bean. And from there we can calculate the 
>>> score of the compendium feature(thinking in average), which will represent 
>>> the bin by the formulae:
>> 
>> Yes this is a consideration for the algorithm of how to derive a score for a 
>> bin, but it is an advanced implementation detail of the algorithm. It is 
>> largely orthogonal to what I mean by consistency of rounding - that is, 
>> deciding what start/end positions to _return_ in the DAS XML for a 
>> particular bin (which must be expressed as a whole number). Your features 
>> response is stating something like "bases X to Y have a score of Z". Whilst 
>> you can divide bases into multiple bins to derive the score, you still have 
>> to say which _whole_ bases that score applies to.
>> 
>> However what I would say about the below algorithm is that, given that you 
>> have to express a region that the score applies to in whole numbers, 
>> dividing by "contribution" ends up presenting something that is not entirely 
>> true because the score for a feature also includes "a bit of the next base 
>> too". You're also making an assumption that a score for base 12345 is 
>> contributing 0.25 of its score to 12344.75, which is only one way of looking 
>> at it (as I was saying in my last email). In reality a start position of 
>> 12345 usually means "12345.0 and upwards" whereas an end position of 12345 
>> usually means "everything less than 12346.0". For these reasons, I think 
>> weighting by contribution is not something I'd be keen on implementing.
>> 
>>> 
>>> Sb = ∑(scoren * contributionn) / (length/#bins)
>>> 
>>> For instance, for a segment  request of 7 bases with maxbins of 3 we can 
>>> have an annotation with its score per base like this:
>>> 
>>> |0.5|0.4|0.5|0.4|0.4|0.5|0.6|
>>> 
>>> and then the score for the 3 bins will be:
>>> 
>>> S1 = (0.5*1 + 0.4*1 + 0.5*0.33)/2.33        = 0.457
>>> S2 = (0.5*0.66 + 0.4*1 + 0.4*0.66)/2.33     = 0.426
>>> S3 = (0.4*0.33 + 0.5*1 + 0.6*1)/2.33        = 0.528
>>> 
>>> So the client receives 3 bins as requested an is its responsibility how to 
>>> draw those but he asked for this number so we can assume that the client 
>>> knows what it is doing.
>>> 
>>> In the cases where there are positions with non annotations or more than 
>>> one annotation per position, we may generalise the formulae by changing the 
>>> divisor to ∑(contributionn)
>>> 
>>> Sb = ∑(scoren * contributionn) / ∑(contributionn)
>>> 
>>> I know that won't work for all the cases but I think it is one of the 
>>> strategies to implement and that the data provider can choose to use or not.
>>> 
>>> What do you guys think?
>>> 
>>> Cheers,
>>> 
>>> Gustavo.
>>> 
>>> On 6 Jul 2011, at 17:54, Andy Jenkinson wrote:
>>> 
>>>> Hi Gustavo,
>>>> 
>>>> I'm not sure what you mean by "non-overlapping" features exactly. For a 
>>>> maxbins of 700, the thing to avoid IMO is returning the first (or random) 
>>>> 700 features because this does not divide the segment up into "bins". If 
>>>> all 700 are in the first bin (which might conceivably be represented by a 
>>>> single pixel on the client), the client might still only be able to show 
>>>> one/a few, whilst the rest of the pixels show nothing.
>>>> 
>>>> Each procedure should divide the segment into 700 bins, and sort features 
>>>> into them by position. Some features might fall into only one bin, others 
>>>> might fall into more than one and may or may not be overlapping. A filter 
>>>> should then be applied to choose which features in each bin should be 
>>>> returned. Such a filter might be based on score (e.g. highest, lowest, 
>>>> sum, mean/median/mode averages), or some other factor. You could also 
>>>> create a new feature representing all those in the bin (e.g. a feature 
>>>> count).
>>>> 
>>>> In my mind there are two complicating factors:
>>>> 
>>>> 1. How to decide which features to return if they are of different sizes 
>>>> and cover a different number of bins.
>>>> 
>>>> Perhaps it makes sense if a feature is selected for one bin that it should 
>>>> be selected for all the other bins it covers, for example, but what if the 
>>>> algorithm is using highest score and the feature selected for the first 
>>>> bin has a substantially lower score than another feature in the second bin?
>>>> 
>>>> 2. Rounding. If you're tired or losing interest it's probably best to stop 
>>>> reading now...
>>>> 
>>>> Consider segment X:12345,98765 (86421 bases) and a maxbins of 1000. 
>>>> Assuming bins are of equal size, each is 86.421 bases long. Obviously it's 
>>>> not possible to express fractions of a base in DAS, so it is important 
>>>> that the server and the client interpret this in the same way. Firstly 
>>>> it's important not to round the bin size at the beginning, which would 
>>>> create an error in the total length or number of bins. So the first bin is 
>>>> >= 12345 and <= 12431.421, and the second is > 2431.421 and <= 12517.842. 
>>>> Which bin(s) does X:12431 fall into? You might be tempted to say "easy, 
>>>> it's in bin 1". But would your answer change if it was a feature at 
>>>> X:12400,12431 [which really means X:12400,12431.99999], or a feature at 
>>>> X:12431,12500? Basically what I am getting at is, do we count an end 
>>>> position as being 12431.0 or 12431.9999999? I believe Ensembl does the 
>>>> former but am not 100% sure and this is probably not strictly speaking 
>>>> accurate.
>>>> 
>>>> Sorry about the complicated numbers... 
>>>> 
>>>> Cheers,
>>>> Andy
>>>> 
>>>> On 6 Jul 2011, at 14:53, Gustavo Salazar wrote:
>>>> 
>>>>> Hey guys,
>>>>> 
>>>>> One of the topics in the workshop was the idea of having a set of 
>>>>> strategies for maxbins, and we said we will discuss it here... so this is 
>>>>> my call to hear ideas about it, i might have a some spare time soon and 
>>>>> if we get a couple of good strategies I can implement them in mydas as 
>>>>> part of its core, so a datasource provider can choose to use one of the 
>>>>> predefined strategies or to define a particular algorithm if is their 
>>>>> wish. 
>>>>> 
>>>>> I suppose the easiest maxbins strategy is to return the X random 
>>>>> non-overlaping features in he segment. 
>>>>> 
>>>>> Any other Ideas?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Gustavo.
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> DAS mailing list
>>>>> [email protected]
>>>>> http://lists.open-bio.org/mailman/listinfo/das
>>>> 
>>> 
>> 
> 


_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Re: [DAS] maxbins

Reply via email to