Hi everyone—

I’m working with a team that’s generating single cell ATAC data in large 
amounts and am designing the framework of an S4 object to facilitate analyses 
in R. I have a couple of high-level questions that I wanted to pose early to 
hopefully attain some community guidance in the implementation of these data 
structures. 


Question on S4 scATAC Structure--
It’s easy to imagine scATAC data as a matrix where the rows are particular 
peaks and the columns are individual samples. We already have such an 
impressive volume of data, such that if stored in an ordinary matrix, we run 
into ~20 GB objects. As these data are very sparse, we store the peak values in 
a sparse matrix (through the Matrix library). I wanted to collate the peak 
information (probably in GRanges object) and sample information (in a data 
frame) as well as some potential meta data in an S4 object.

Easy enough, sure, but after looking at the scRNA structure (e.g. scater 
<https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette.html>),
 I feel like I should be considering how to inherit some of the nice properties 
from the canonical `ExpressionSet` structure. However, since my constraints 
aren’t directly compatible (namely the featureData slot really needs to be a 
GRanges and the exprs slot must be an object from Matrix), it wasn’t clear to 
me how to maximize the inheritance properties while adjusting to my unique 
constraints. Also, it wasn’t clear to me whether or not I could inherent 
`SummarizedExperiment` due to the different nature of the sparse matrix. Does 
anyone have any advice on this structure? 


Question on reading sparse matrices from disk--
I’m trying to work out the best to selectively read certain rows and columns 
from a sparse matrix on disk into memory. I anticipate a time fairly soon that 
loading our full scATAC data, even in sparse matrices, is going to be 
untenable. Any matrix reading/slicing implementations that I’ve seen don’t play 
friendly with sparse matrices. So, I hacked together two solutions— 1) reads 
and subsets a gzipped matrix with 3 columns (row index; column index; non-zero 
value) through a system call to awk. 2) converts that same 3 column matrix into 
an SQLite object and send queries to read values based on indices. The hiccups 
are that 1) doesn’t play friendly on non-unix platforms and always scans the 
full file, and 2) is faster for querying, but the binary object is ~7x larger 
than the gzipped object. I’ve played around with hdf5 as well, but it didn’t 
seem to give me much back in terms of speed or storage benefits comparatively. 
Has anyone found an implementation that achieves a decent lookup time and 
compression, or am I essentially needing to choose between the two?



Thanks and have a great weekend!
-Caleb
        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to