Re: [Bioc-devel] [devteam-bioc] suggestion about bioconductor package-- BSgenome.Hsapiens.NCBI.GRCh38

Hervé Pagès Wed, 16 Apr 2014 15:16:27 -0700

Hi Michael,

On 04/16/2014 02:33 PM, Michael Lawrence wrote:

Another interesting aspect of 2bit is that it supports simple masking.
I'm all for the 2bit direction. But then BSgenome would depend on
rtracklayer?


Any reason I should not do that? ;-)

Or have you reimplemented it?


Nope.

H.



On Wed, Apr 16, 2014 at 12:59 PM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:

    Hi Sean

    [hope you don't mind if I cc Bioc-devel]

    On 04/15/2014 11:47 PM, Maintainer wrote:

        Hi The Bioconductor Dev Team,
        A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been
        available
        for one week. But the single-sequence.fa.gz file in this package
        is too
        big. The sequences of all the chromosomes are put together. It is
        difficult to load.  Why do you remake this
        package as BSgenome.Hsapiens.UCSC.hg19 do? In
        BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
        separated.


    Yeah I know... sigh!!  ;-)

    All the BSgenome data packages use this new on disk storage format, not
    just GRCh38. All the chromosomes are now put together in a single
    FASTA file compressed in the RAZip format (extension is rz, not gz).
    The file is indexed so it makes direct random access faster. So for
    example, extracting only some small portions of the genome with
    getSeq() is much faster than with the previous on disk storage format
    where the chromosomes were serialized into individual files.
    Some people have manifested interest in having getSeq() do fast
    random access to the genome sequences for a while. See for example
    this post from Michael in 2011:

    https://stat.ethz.ch/__pipermail/bioc-devel/2011-May/__002601.html
    <https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html>

    I'm aware that the new on disk storage format slows down operations
    where you actually need to compute on the entire genome, which is
    typically done in a loop where one chromosome is loaded at a time.
    It's hard to beat the speed (and simplicity) of just loading the
    individual serialized DNAString objects in that case. This is
    unfortunate and I was a little bit worried about this when I pushed
    the new BSgenome packages to the public repos because I think this
    is a common use case.

    Note that the switch to this new format happened in devel in January
    and was announced on the Bioc-devel mailing list:

    https://stat.ethz.ch/__pipermail/bioc-devel/2014-__January/005150.html
    <https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html>

    I was hoping people would test this and maybe start a discussion about
    whether this change was worth it or not.

    The good news is that I think we can have the best of both worlds i.e.
    fast direct random access and fast loading of the full sequences.
    I did some testing with the 2bit format and it looks very promising:

       - Random access is much faster than with current RAZip'ed FASTA
         format,

       - Loading a full chromosome into memory is also very fast because
         there isn't the overhead of decompressing the data. It could
         even be that it's faster than loading the chromosome stored in
         a serialized DNAString object, or maybe not, but at least it
         should be very close.

       - The sequences take less space on disk and the resulting BSgenome
         package will also be slightly smaller.

    It's something that I was planning to work on early in this new devel
    cycle. Seems like I should start even earlier and maybe backport to
    BioC release?

    Thanks,
    H.


        Best,
        Sean

        2014-04-16
        
------------------------------__------------------------------__------------


        
____________________________________________________________________________
        devteam-bioc mailing list
        To unsubscribe from this mailing list send a blank email to
        devteam-bioc-leave@lists.__fhcrc.org
        <mailto:devteam-bioc-le...@lists.fhcrc.org>
        You can also unsubscribe or change your personal options at
        https://lists.fhcrc.org/__mailman/listinfo/devteam-bioc
        <https://lists.fhcrc.org/mailman/listinfo/devteam-bioc>


    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

    _________________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] [devteam-bioc] suggestion about bioconductor package-- BSgenome.Hsapiens.NCBI.GRCh38

Reply via email to