Hi,

This discussion went off-line and I wanted to give a summary of what we decided to go with.

We'll create a new package, BiocFile, that has a minimal API.

API:
- 'File' class (virtual, reference class) and constructor
- close / open / isOpen
- import / export
- file registry

We won't require existing *File classes to implement yield but would 'recommend' that new *File classes do. By getting this structure in place we can guide future *File developments in a consistent direction even if we can't harmonize all current classes. I'll start work on this after the release.

Thanks again for the input.

Valerie

On 03/11/2014 10:23 PM, Michael Lawrence wrote:



On Tue, Mar 11, 2014 at 3:33 PM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:

    On 03/11/2014 02:52 PM, Hervé Pagès wrote:

        On 03/11/2014 09:57 AM, Valerie Obenchain wrote:

            Hi Herve,

            On 03/10/2014 10:31 PM, Hervé Pagès wrote:

                Hi Val,

                I think it would help understand the motivations behind
                this proposal
                if you could give an example of a method where the user
                cannot supply
                a file name but has to create a 'File' (or 'FileList')
                object first.
                And how the file registry proposal below would help.
                It looks like you have such an example in the
                GenomicFileViews package.
                Do you think you could give more details?


            The most recent motivating use case was in creating
            subclasses of
            GenomicFileViews objects (BamFileViews, BigWigFileViews,
            etc.) We wanted
            to have a general constructor, something like
            GenomicFileViews(), that
            would create the appropriate subclass. However to create the
            correct
            subclass we needed to know if the files were bam, bw, fasta etc.
            Recognition of the file type by extension would allow us to
            do this with
            no further input from the user.


        That helps, thanks!

        Having this kind of general constructor sounds like it could
        indeed be
        useful. Would be an opportunity to put all these *File classes
        (the 22
        RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
        subclasses defined in Rsamtools) under the same umbrella (i.e. a
        parent
        virtual class) and use the name of this virtual class (e.g.
        File) for
        the general constructor.

        Allowing a registration mechanism to extend the knowledge of
        this File()
        constructor is an implementation detail. I don't see a lot of
        benefit to
        it. Only a package that implements a concrete File subclass would
        actually need to register the new subclass. Sounds easy enough
        to ask
        to whoever has commit access to the File() code to modify it. This
        kind of update might also require adding the name of the package
        where
        the new File subclass is implemented to the Depends/Imports/Suggests
        of the package where File() lives, which is something that cannot be
        done via a registration mechanism.


    This clean-up of the *File jungle would also be a good opportunity to:

       - Choose what we want to do with reference classes: use them for all
         the *File classes or for none of them. (Right now, those defined
         in Rsamtools are reference classes, and those defined in
         rtracklayer are not.)

       - Move the I/O functionality currently in rtracklayer to a
         separate package. Based on the number of contributed packages I
         reviewed so far that were trying to reinvent the wheel because
         they had no idea that the I/O function they needed was actually
         in rtracklayer, I'd like to advocate for using a package name
         that makes it very clear that it's all about I/O.



I can see some benefit in renaming/reorganizing, but if they weren't
able to perform a simple google search for functionality, I don't think
the name of the package was the problem. "read gff bioconductor" returns
rtracklayer as the top hit.


    H.



        H.



            Val


                Thanks,
                H.


                On 03/10/2014 08:46 PM, Valerie Obenchain wrote:

                    Hi all,

                    I'm soliciting feedback on the idea of a general
                    file 'registry' that
                    would identify file types by their extensions. This
                    is similar in
                    spirit
                    to FileForformat() in rtracklayer but a more general
                    abstraction that
                    could be used across packages. The goal is to allow
                    a user to supply
                    only file name(s) to a method instead of first
                    creating a 'File' class
                    such as BamFile, FaFile, BigWigFile etc.

                    A first attempt at this is in the GenomicFileViews
                    package
                    (https://github.com/__Bioconductor/GenomicFileViews
                    <https://github.com/Bioconductor/GenomicFileViews>)__.
                    A registry (lookup)
                    is created as an environment at load time:

                    .fileTypeRegistry <- new.env(parent=emptyenv()

                    Files are registered with an information triplet
                    consisting of class,
                    package and regular expression to identify the
                    extension. In
                    GenomicFileViews we register FaFileList, BamFileList
                    and BigWigFileList
                    but any 'File' class can be registered that has a
                    constructor of the
                    same name.

                    .onLoad <- function(libname, pkgname)
                    {
                          registerFileType("FaFileList", "Rsamtools",
                    "\\.fa$")
                          registerFileType("FaFileList", "Rsamtools",
                    "\\.fasta$")
                          registerFileType("BamFileList"__, "Rsamtools",
                    "\\.bam$")
                          registerFileType("__BigWigFileList",
                    "rtracklayer", "\\.bw$")
                    }

                    The makeFileType() helper creates the appropriate
                    class. This function
                    is used behind the scenes to do the lookup and
                    coerce to the correct
                    'File' class.

                      > makeFileType(c("foo.bam", "bar.bam"))
                    BamFileList of length 2
                    names(2): foo.bam bar.bam

                    New types can be added at any time with
                    registerFileType():

                    registerFileType(NewClass, NewPackage,
                    "\\.NewExtension$")


                    Thoughts:

                    (1) If this sounds generally useful where should it
                    live? rtracklayer,
                    GenomicFileViews or other? Alternatively it could be
                    its own
                    lightweight
                    package (FileRegister) that creates the registry and
                    provides the
                    helpers. It would be up to the package authors that
                    depend on
                    FileRegister to register their own files types at
                    load time.

                    (2) To avoid potential ambiguities maybe searching
                    should be by regex
                    and package name. Still a work in progress.


                    Valerie

                    _________________________________________________
                    Bioc-devel@r-project.org
                    <mailto:Bioc-devel@r-project.org> mailing list
                    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>






    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>




--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to