Maybe off topic, my work environment deals with datasets sized 20~200 Giga Bytes, consisting of small time series arrays mostly, I offload the compression (and dedup) work to ZFS (by deploying a SmartOS storage server managing a dozen of spinning disks with several TB capacity.

Many computing nodes run a local FUSE mount viewing those data files over network, as if being part of a virtual large data file in local filesystem, and access (mostly reads, small fraction of writes) the data via mmap. This way, parallel processes run on multi CPU cores of a single computing node share the OS' kernel page for cache of the dataset, a program just assumes random access to the whole dataset as available at somewhere within its virtual address space.

Giving just enough physical RAM (in order to prevent thrashing) to the storage server and computing nodes (my env currently have a typical size of 128GB per node), this achieves both simplicity of programming and efficient use of processor/memory/storage resources.

This architecture should scale well to datasets of a few TBs.

On 2020/3/16 上午3:46, Olaf Klinke wrote:
By the way, there are tools to retrieve a certain range from compressed data, 
which IMHO is a very cool feature of gzip.

https://www.htslib.org/doc/bgzip.html
https://www.htslib.org/doc/tabix.html

Bioinformaticians use it (among other things) for fast retrieval of genomic 
annotation from data sets with on the order of 10^9 keys (in case of human 
genome). Would be nice if someone wrote a Haskell binding.

Olaf

Hello,

I have a question regarding 'Data.Map' api, filed an issue
https://github.com/haskell/containers/issues/708

And may be I can ask here at the same time?

I'm not sure why|Data.Map|doesn't have a key range based visiting API, I
figured out I can do it this way:

|indexKeyRange :: IndexKey -> IndexKey -> Map IndexKey Object ->
[(IndexKey, Object)] indexKeyRange !minKey !maxKey = toList .
takeWhileAntitone (<= maxKey) . dropWhileAntitone (< minKey) |

But wouldn't it save the computation needed to re-balance the
intermediate tree generated ? Or that re-balancing can be optimized out
actually ?

I am creating an in-memory graph database, using|Data.Map.Strict.Map|as
business object indices with specified object attributes. The typical
scenario will be querying a small number of entries by key range, out of
possibly all business objects of a certain class globally, so the
implementation above would work, but not reasonable by far as it seems.

I think a lazy list returned by mere node visiting (i.e. no new node
creation) would satisfy my needs, or I missed something ?


Thanks,

Compl
_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

Reply via email to