On 12/09/2013 10:50 AM, Kasper Daniel Hansen wrote:
I agree with Michael. I don't think we want to deprive ourselves of good approaches by a need for supporting Windows. Especially in a case like this where on-disc representation is optional.
I agree mmap is appealing. I just didn't want to have to depend on it in XVector, which is at the bottom of the package stack. For now my focus/interest is more on the OnDiskVector concept/API. Specific storage back-ends can be implemented as concrete subclasses. There are already 2 of them (DirectRaw and SerializedRaw). Others can be added for mmap and HDF5 for example. They don't necessarily have to be implemented in XVector. H.
Kasper On Mon, Dec 9, 2013 at 1:46 PM, Michael Lawrence <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote: On Mon, Dec 9, 2013 at 9:30 AM, Hervé Pagès <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>> wrote: > On 12/09/2013 05:39 AM, Michael Lawrence wrote: > >> Any thoughts about using mmap(), so that SharedRaw and OnDiskRaw just >> operate on a pointer as the abstraction? >> > > Martin mentioned mmap to me for this project but I had some concerns > about Windows compatibility. Are there CRAN or BioC packages that use > it? Would be interesting to have a look at them. > bigmemory is a CRAN package, and it is extended by bigmemoryExtras in Bioconductor. No Windows version available, of course. But seriously, who uses Windows to crunch data? Easy enough to fallback to the in-memory implementation. > H. > > >> Michael >> >> >> On Sun, Dec 8, 2013 at 11:39 PM, Hervé Pagès <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> >> <mailto:hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>>> wrote: >> >> Hi Michael, >> >> The OnDiskXRaw virtual class (if this is what you're referring to) >> is still a very early work-in-progress. The idea is to experiment >> with on-disk representation of atomic vectors and direct random access >> to subsequences of the vector. The exact storage mode is implemented >> by >> concrete subclasses (currently only DirectRaw and SerializedRaw). >> OnDiskXRaw is actually analog to SharedRaw except that with the latter >> the "shared" sequence of bytes resides in memory. >> >> If we had "on-disk" support for all atomic vectors, it sounds like it >> would then be easy to support "on-disk" versions of higher-level >> objects like IRanges or GRanges. They would be defined as their >> "in-memory" counterpart except that the slots that are atomic vectors >> in the "in-memory" version would just need to be replaced by "on-disk" >> atomic vectors. "On-disk" versions of DNAString (and even >> DNAStringSet) >> objects could also easily be implemented e.g. by just making the >> "shared" slot an OnDiskXRaw object instead of a SharedRaw object. >> >> Putting SharedRaw and OnDiskXRaw under the same umbrella (i.e. under >> a virtual class) and using that virtual class to specify the slot of >> higher-level objects like DNAString is tempting but realistically we >> don't operate on an on-disk object like we do on an in-memory object. >> >> Having an "on-disk" version of DNAString with direct random access was >> in fact the initial motivation for OnDiskXRaw. The use case for this >> was to support direct random access in BSgenome objects without having >> to change the way the chromosomes are stored on disk (they're stored >> as serialized raw vectors). I've finally implemented this feature >> (will >> soon be pushed to BioC devel) but I changed the storage and didn't use >> OnDiskXRaw in the end. >> >> H. >> >> >> >> On 12/05/2013 06:43 AM, Michael Lawrence wrote: >> >> A nice goal for the XVector package would be full implementation >> of the R >> vector API on top of the already existing memory-sharing (rather >> than >> memory-duplicating) data structures. The actual storage mode of >> the data >> should be obviously be abstracted, e.g., on-disk should be >> treated the same >> as the externalptr representation. Much of the implementation >> will need to >> be in C, unless we want to pay the price of extracting things >> into ordinary >> R vectors. Should the abstraction be therefore dropped down to >> the C level, >> so that the implementations can more easily share from each >> other? Anything >> to gain here from the externalVector package? >> >> [[alternative HTML version deleted]] >> >> _________________________________________________ >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> <mailto:Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>> >> mailing list >> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >> >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> <mailto:hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel