I think there just needs to be a consensus on how the conversion is done so that the server can give coordinates that make sense for how it will be drawn by the client.
I am investigating how Ensembl does it, which appears in brief to be: 1. multiply the position by the number of pixels per base pair 2. round down The way it does it is fairly complicated as I think it uses relative positions (i.e. segment start plus feature offset), converts back and forth from start + length rather than start + end, and some parts are zero-indexed whereas others are 1-indexed. But I *think* in effect it rounds both the start and end positions down. I have however come across some things that look strange which I am hoping the Ensembl web team will help me look into. Cheers, Andy On 11 Jul 2011, at 12:05, Gustavo Salazar wrote: > I finally understood what is the issue you were pointing at... i was looking > the problem just from the point of the view of calculating the score as a > result and not about the coordinates, somehow I always though in this as a > set of pseudo-features that are representing the original ones, where each > feature represents a bin... but now I see the point that cannot be like that > cause the clients are developed by working with feature coordinates and not > bin position. > So any suggestion of how to deal with these? > > > > On 7 Jul 2011, at 19:09, Andy Jenkinson wrote: > >> On 7 Jul 2011, at 16:45, Gustavo Salazar wrote: >> >>> Hey guys, >>> >>> Thanks Rafa for remind us the minutes of the workshop, it is a good place >>> to start. >>> >>> Andy, what I meant as 'non-overlaping' strategy, was no really dividing the >>> segment into bins, but using the numbers of requested bins just as the >>> maximum number of features to return in the segment, which in that case is >>> can be selected in a random way or by size, so the inclusion in the >>> response of one feature in conditioned to no overlap with a previous >>> selected feature. That obviously is no necessarily the most representative >>> set of features but again I think that works in order to having a very >>> simple maxbins implementation, although i can see that it doesn't really >>> goes with the concept of bins. >> >> What I am trying to get across is that "maxbins=700" is NOT "I want 700 >> features". It is "I have 700 pixels/subdivisions/something abstract to >> render this segment in". It is fine to have a simple binning strategy of >> "non-overlapping features" (although I don't like random selection as this >> will be inconsistent between page refreshes). But this does not mean that >> you will always get 700 features. The point is that the client is telling >> the server how much space the features will be rendered in, and it is up to >> the server to decide what is an appropriate number and distribution of >> features to return. So a more meaningful implementation of "non-overlapping >> features" would be to provide ALL non-overlapping features, where overlaps >> are BY BIN not by position. I may be missing something, but I can't see a >> case for wanting to give the first X features regardless of where they >> appear in the query segment. >> >>> In the cases where the goal is to represent the score the way to go IMO is >>> to create a feature that represent the bin(avg, max or min). And for those >>> cases your question 1 is solved, because the feature is contributing to all >>> the bins that is covering, we can think in a formula that consider the >>> number of bases that features contribute to a bin(see below), and in this >>> way give a more representative ranking. In the cases that the score is >>> meaningless, the same feature can be used in all the bins that is covering. >> >> This is one interpretation of a score-based algorithm, assuming that a >> feature is simply a score for the region of the genome. This is fine, and >> comes under the category of replacing the feature with a new one (although >> there is an issue of how notes/types etc are merged). However it is also >> quite feasible to have features that _have_ scores, but are not themselves >> _purely_ scores. Hypothetically, let's say genes are ranked by score, and as >> we zoom out we want to see the highest-ranking genes in the region. The >> algorithm would be expected to filter features, not modify/replace them. It >> is not an insurmountable problem, it just needs to be recognised that there >> is a difference between giving each bin a score, and choosing features in >> bins based on score. >> >>> About the rounding issue I think thats a complete problem of the server, >>> meaning that if the client is asking for 700 bins, it shouldn't care about >>> how it was calculated and the response should be the 700 features that he >>> requested. >> >> Again, 700 bins does NOT mean 700 features. Remember also that bins is an >> expression of something that is specific to the client. When rendering those >> bins, the client must convert between position and bin so it is important >> that it does so in the same way that the server did it. The client >> internally has one view of the positions that are represented by each bin, >> and if the rounding is not the same then it will render the feature "out by >> one bin". >> >> Consider this example. My client has divided a segment of 1001 bases into >> 1000 bins. Note 1 of these bases has to go into a bin with another base. >> After handing that to the server, let's imagine that the server assigns a >> score to each bin and returns a feature for each bin like this: >> feature 1 = start 1, end 2 = score 56 >> feature 2 = start 3, end 3 = score 14 >> feature 3 = start 4, end 4 = score 97 >> ... >> feature 999 = start 999, end 999 = score 11 >> feature 1000 = start 1000, end 1000 = score 65 >> >> The client then takes the start/end positions and converts them into >> rendering space: >> >> bin 1 = feature 1 = score 56 >> bin 2 = feature 1 = score 56 >> bin 3 = feature 2 = score 14 >> bin 4 = feature 3 = score 97 >> ... >> bin 1000 = feature 999 + feature 1000 = score ??? >> >> Now what has happened here is that the server and client have not used the >> same rounding mechanism for assigning positions into bins, and thus produce >> different results. >> >>> From the side of the server we can divide the problem in the ones based on >>> continuos values, usually based on the score, and the ones with discrete >>> values(frequency of types, etc). I wont go in this email to the discrete >>> case because i haven't though about it :-) >>> >>> **WARNING** Dunno how this is going to look in pure text. >>> **WARNING** calculation in the example may not be exact. >>> >>> For the continuos case, even that conceptually speaking is not possible to >>> speak about a fraction of a base, we can know what is the proportion of its >>> contribution for an specific bean. And from there we can calculate the >>> score of the compendium feature(thinking in average), which will represent >>> the bin by the formulae: >> >> Yes this is a consideration for the algorithm of how to derive a score for a >> bin, but it is an advanced implementation detail of the algorithm. It is >> largely orthogonal to what I mean by consistency of rounding - that is, >> deciding what start/end positions to _return_ in the DAS XML for a >> particular bin (which must be expressed as a whole number). Your features >> response is stating something like "bases X to Y have a score of Z". Whilst >> you can divide bases into multiple bins to derive the score, you still have >> to say which _whole_ bases that score applies to. >> >> However what I would say about the below algorithm is that, given that you >> have to express a region that the score applies to in whole numbers, >> dividing by "contribution" ends up presenting something that is not entirely >> true because the score for a feature also includes "a bit of the next base >> too". You're also making an assumption that a score for base 12345 is >> contributing 0.25 of its score to 12344.75, which is only one way of looking >> at it (as I was saying in my last email). In reality a start position of >> 12345 usually means "12345.0 and upwards" whereas an end position of 12345 >> usually means "everything less than 12346.0". For these reasons, I think >> weighting by contribution is not something I'd be keen on implementing. >> >>> >>> Sb = ∑(scoren * contributionn) / (length/#bins) >>> >>> For instance, for a segment request of 7 bases with maxbins of 3 we can >>> have an annotation with its score per base like this: >>> >>> |0.5|0.4|0.5|0.4|0.4|0.5|0.6| >>> >>> and then the score for the 3 bins will be: >>> >>> S1 = (0.5*1 + 0.4*1 + 0.5*0.33)/2.33 = 0.457 >>> S2 = (0.5*0.66 + 0.4*1 + 0.4*0.66)/2.33 = 0.426 >>> S3 = (0.4*0.33 + 0.5*1 + 0.6*1)/2.33 = 0.528 >>> >>> So the client receives 3 bins as requested an is its responsibility how to >>> draw those but he asked for this number so we can assume that the client >>> knows what it is doing. >>> >>> In the cases where there are positions with non annotations or more than >>> one annotation per position, we may generalise the formulae by changing the >>> divisor to ∑(contributionn) >>> >>> Sb = ∑(scoren * contributionn) / ∑(contributionn) >>> >>> I know that won't work for all the cases but I think it is one of the >>> strategies to implement and that the data provider can choose to use or not. >>> >>> What do you guys think? >>> >>> Cheers, >>> >>> Gustavo. >>> >>> On 6 Jul 2011, at 17:54, Andy Jenkinson wrote: >>> >>>> Hi Gustavo, >>>> >>>> I'm not sure what you mean by "non-overlapping" features exactly. For a >>>> maxbins of 700, the thing to avoid IMO is returning the first (or random) >>>> 700 features because this does not divide the segment up into "bins". If >>>> all 700 are in the first bin (which might conceivably be represented by a >>>> single pixel on the client), the client might still only be able to show >>>> one/a few, whilst the rest of the pixels show nothing. >>>> >>>> Each procedure should divide the segment into 700 bins, and sort features >>>> into them by position. Some features might fall into only one bin, others >>>> might fall into more than one and may or may not be overlapping. A filter >>>> should then be applied to choose which features in each bin should be >>>> returned. Such a filter might be based on score (e.g. highest, lowest, >>>> sum, mean/median/mode averages), or some other factor. You could also >>>> create a new feature representing all those in the bin (e.g. a feature >>>> count). >>>> >>>> In my mind there are two complicating factors: >>>> >>>> 1. How to decide which features to return if they are of different sizes >>>> and cover a different number of bins. >>>> >>>> Perhaps it makes sense if a feature is selected for one bin that it should >>>> be selected for all the other bins it covers, for example, but what if the >>>> algorithm is using highest score and the feature selected for the first >>>> bin has a substantially lower score than another feature in the second bin? >>>> >>>> 2. Rounding. If you're tired or losing interest it's probably best to stop >>>> reading now... >>>> >>>> Consider segment X:12345,98765 (86421 bases) and a maxbins of 1000. >>>> Assuming bins are of equal size, each is 86.421 bases long. Obviously it's >>>> not possible to express fractions of a base in DAS, so it is important >>>> that the server and the client interpret this in the same way. Firstly >>>> it's important not to round the bin size at the beginning, which would >>>> create an error in the total length or number of bins. So the first bin is >>>> >= 12345 and <= 12431.421, and the second is > 2431.421 and <= 12517.842. >>>> Which bin(s) does X:12431 fall into? You might be tempted to say "easy, >>>> it's in bin 1". But would your answer change if it was a feature at >>>> X:12400,12431 [which really means X:12400,12431.99999], or a feature at >>>> X:12431,12500? Basically what I am getting at is, do we count an end >>>> position as being 12431.0 or 12431.9999999? I believe Ensembl does the >>>> former but am not 100% sure and this is probably not strictly speaking >>>> accurate. >>>> >>>> Sorry about the complicated numbers... >>>> >>>> Cheers, >>>> Andy >>>> >>>> On 6 Jul 2011, at 14:53, Gustavo Salazar wrote: >>>> >>>>> Hey guys, >>>>> >>>>> One of the topics in the workshop was the idea of having a set of >>>>> strategies for maxbins, and we said we will discuss it here... so this is >>>>> my call to hear ideas about it, i might have a some spare time soon and >>>>> if we get a couple of good strategies I can implement them in mydas as >>>>> part of its core, so a datasource provider can choose to use one of the >>>>> predefined strategies or to define a particular algorithm if is their >>>>> wish. >>>>> >>>>> I suppose the easiest maxbins strategy is to return the X random >>>>> non-overlaping features in he segment. >>>>> >>>>> Any other Ideas? >>>>> >>>>> Regards, >>>>> >>>>> Gustavo. >>>>> >>>>> >>>>> _______________________________________________ >>>>> DAS mailing list >>>>> [email protected] >>>>> http://lists.open-bio.org/mailman/listinfo/das >>>> >>> >> > _______________________________________________ DAS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/das
