Hi Michael,

On 01/15/2015 11:59 AM, Michael Lawrence wrote:
My concern is mostly in user code not seen in Bioc svn.

I understand but the fate of that code is to get out of sync
sooner or later. And sooner rather than later if it relies on
undocumented behavior.

But perhaps the
partial sorting (by query) is sufficient for many of those.

It seems to be sufficient for more than 99.5% of the packages in
BioC svn :-)

Note that keeping Hits objects partially sorted instead of fully
sorted not only speeds up findOverlaps() but also basic operations
on Hits objects like union(), t(), etc...

Since we are on it, I should also mention that new in BioC 3.1 is a
Hits() constructor function which takes care of partially sorting the
hits, selectHits() for selecting hits in the same way the 'select'
arg of findOverlaps() does, and all the comparison operations (==, <=,
order, sort, rank, etc..., see ?`Hits-comparison` in S4Vectors).

Cheers,
H.


On Thu, Jan 15, 2015 at 11:34 AM, Hervé Pagès <hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>> wrote:

    Hi guys,

    Indeed, the Hits object returned by findOverlaps() is not fully
    sorted anymore. Now it's sorted by query hit *only* and not by query
    hit *and* subject hit. Fully sorting a big Hits object has a high
    cost, both in terms of time and memory footprint. The partial
    sorting is *much* cheaper: it's done using a "tabulated sorting"
    algo implemented in C that works in linear time.

    The partial sorting is important: it allows a very common
    transformation like as(hits, "List") to be super fast. But the
    full sorting was overkill and generally not needed. Also note that
    the full sorting was never enforced via the validity method for
    Hits objects (and t(hits) was breaking that order in BioC < 3.1).
    Now the validity method for Hits enforces the partial sorting and
    t(hits) preserves it.

    There were only 3 or 4 packages that broke in devel because of
    that change (typically the change broke their unit tests). I fixed
    them (except Repitools, but it's still on my list). The fix is easy:
    if having the hits fully sorted matters, just use sort() on the Hits
    object. The man page for ?findOverlaps will soon be updated to
    reflect these changes.

    Cheers,
    H.



    On 01/15/2015 06:42 AM, Kasper Daniel Hansen wrote:

        Has it ever been documented that the return object is sorted in
        a specific
        way?  I just want to make sure we think about whether that is
        something we
        want to enforce giving the possibility of using a different
        algorithm in
        the future.

        We could also address this by implementing (perhaps it already
        exists) a
        sort() method for the return object.  That would still break
        existing code
        though.

        Best,
        Kasper

        On Wed, Jan 14, 2015 at 11:13 PM, Michael Lawrence <
        lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote:

            I bet there is a lot of code that depends on having the hits
            (conveniently)
            ordered by query,subject index, so we should try to restore
            the previous
            behavior.

            On Wed, Jan 14, 2015 at 8:00 PM, Dario Strbenac <
            dstr7...@uni.sydney.edu.au <mailto:dstr7...@uni.sydney.edu.au>>
            wrote:

                Hello,

                For an identical query, the matrix results are in a
                different order.
                Consider the subject hits of the last two rows :

                    mapping        # R Under development (unstable)
                    (2015-01-13 r67453) and

                IRanges 2.1.35
                       queryHits subjectHits
                [1,]         1           1
                [2,]         1           4
                [3,]         2           2
                [4,]         4           1
                [5,]         4           4
                [6,]         6           7
                [7,]         6           6

                    mapping        # R Under development (unstable)
                    (2015-01-13 r67453) and

                IRanges 2.0.1
                       queryHits subjectHits
                [1,]         1           1
                [2,]         1           4
                [3,]         2           2
                [4,]         4           1
                [5,]         4           4
                [6,]         6           6
                [7,]         6           7

                This causes some values to be extracted in a different
                order by our
                annotationLookup function, and causes an error for the
                development

            version

                of Repitools on a test case which uses all.equal to
                compare a list to a
                correct list, but not for the release version which uses
                the release
                version of IRanges. Should I update the test case to
                have a new expected
                result, or is this new characteristic of findOverlaps
                likely to revert to
                the previous output soon ?

                The two sets of intervals to produce this result are
                anno and probesGR,
                defined in the tests.R file in the Repitools package.

                ------------------------------__--------
                Dario Strbenac
                PhD Student
                University of Sydney
                Camperdown NSW 2050
                Australia
                _________________________________________________
                Bioc-devel@r-project.org
                <mailto:Bioc-devel@r-project.org> mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


                      [[alternative HTML version deleted]]

            _________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


                 [[alternative HTML version deleted]]

        _________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to