Hi,
This discussion went off-line and I wanted to give a summary of what we
decided to go with.
We'll create a new package, BiocFile, that has a minimal API.
API:
- 'File' class (virtual, reference class) and constructor
- close / open / isOpen
- import / export
- file registry
We won't require existing *File classes to implement yield but would
'recommend' that new *File classes do. By getting this structure in
place we can guide future *File developments in a consistent direction
even if we can't harmonize all current classes. I'll start work on this
after the release.
Thanks again for the input.
Valerie
On 03/11/2014 10:23 PM, Michael Lawrence wrote:
On Tue, Mar 11, 2014 at 3:33 PM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:
On 03/11/2014 02:52 PM, Hervé Pagès wrote:
On 03/11/2014 09:57 AM, Valerie Obenchain wrote:
Hi Herve,
On 03/10/2014 10:31 PM, Hervé Pagès wrote:
Hi Val,
I think it would help understand the motivations behind
this proposal
if you could give an example of a method where the user
cannot supply
a file name but has to create a 'File' (or 'FileList')
object first.
And how the file registry proposal below would help.
It looks like you have such an example in the
GenomicFileViews package.
Do you think you could give more details?
The most recent motivating use case was in creating
subclasses of
GenomicFileViews objects (BamFileViews, BigWigFileViews,
etc.) We wanted
to have a general constructor, something like
GenomicFileViews(), that
would create the appropriate subclass. However to create the
correct
subclass we needed to know if the files were bam, bw, fasta etc.
Recognition of the file type by extension would allow us to
do this with
no further input from the user.
That helps, thanks!
Having this kind of general constructor sounds like it could
indeed be
useful. Would be an opportunity to put all these *File classes
(the 22
RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
subclasses defined in Rsamtools) under the same umbrella (i.e. a
parent
virtual class) and use the name of this virtual class (e.g.
File) for
the general constructor.
Allowing a registration mechanism to extend the knowledge of
this File()
constructor is an implementation detail. I don't see a lot of
benefit to
it. Only a package that implements a concrete File subclass would
actually need to register the new subclass. Sounds easy enough
to ask
to whoever has commit access to the File() code to modify it. This
kind of update might also require adding the name of the package
where
the new File subclass is implemented to the Depends/Imports/Suggests
of the package where File() lives, which is something that cannot be
done via a registration mechanism.
This clean-up of the *File jungle would also be a good opportunity to:
- Choose what we want to do with reference classes: use them for all
the *File classes or for none of them. (Right now, those defined
in Rsamtools are reference classes, and those defined in
rtracklayer are not.)
- Move the I/O functionality currently in rtracklayer to a
separate package. Based on the number of contributed packages I
reviewed so far that were trying to reinvent the wheel because
they had no idea that the I/O function they needed was actually
in rtracklayer, I'd like to advocate for using a package name
that makes it very clear that it's all about I/O.
I can see some benefit in renaming/reorganizing, but if they weren't
able to perform a simple google search for functionality, I don't think
the name of the package was the problem. "read gff bioconductor" returns
rtracklayer as the top hit.
H.
H.
Val
Thanks,
H.
On 03/10/2014 08:46 PM, Valerie Obenchain wrote:
Hi all,
I'm soliciting feedback on the idea of a general
file 'registry' that
would identify file types by their extensions. This
is similar in
spirit
to FileForformat() in rtracklayer but a more general
abstraction that
could be used across packages. The goal is to allow
a user to supply
only file name(s) to a method instead of first
creating a 'File' class
such as BamFile, FaFile, BigWigFile etc.
A first attempt at this is in the GenomicFileViews
package
(https://github.com/__Bioconductor/GenomicFileViews
<https://github.com/Bioconductor/GenomicFileViews>)__.
A registry (lookup)
is created as an environment at load time:
.fileTypeRegistry <- new.env(parent=emptyenv()
Files are registered with an information triplet
consisting of class,
package and regular expression to identify the
extension. In
GenomicFileViews we register FaFileList, BamFileList
and BigWigFileList
but any 'File' class can be registered that has a
constructor of the
same name.
.onLoad <- function(libname, pkgname)
{
registerFileType("FaFileList", "Rsamtools",
"\\.fa$")
registerFileType("FaFileList", "Rsamtools",
"\\.fasta$")
registerFileType("BamFileList"__, "Rsamtools",
"\\.bam$")
registerFileType("__BigWigFileList",
"rtracklayer", "\\.bw$")
}
The makeFileType() helper creates the appropriate
class. This function
is used behind the scenes to do the lookup and
coerce to the correct
'File' class.
> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2
names(2): foo.bam bar.bam
New types can be added at any time with
registerFileType():
registerFileType(NewClass, NewPackage,
"\\.NewExtension$")
Thoughts:
(1) If this sounds generally useful where should it
live? rtracklayer,
GenomicFileViews or other? Alternatively it could be
its own
lightweight
package (FileRegister) that creates the registry and
provides the
helpers. It would be up to the package authors that
depend on
FileRegister to register their own files types at
load time.
(2) To avoid potential ambiguities maybe searching
should be by regex
and package name. Still a work in progress.
Valerie
_________________________________________________
Bioc-devel@r-project.org
<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: voben...@fhcrc.org
Phone: (206) 667-3158
Fax: (206) 667-1319
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel