Re: [Biohaskell] [Haskell-cafe] Data.Map - visiting tree nodes withi a given key range ?

Compl Yue Mon, 16 Mar 2020 03:17:22 -0700

Maybe off topic, my work environment deals with datasets sized 20~200Giga Bytes, consisting of small time series arrays mostly, I offload thecompression (and dedup) work to ZFS (by deploying a SmartOS storageserver managing a dozen of spinning disks with several TB capacity.

Many computing nodes run a local FUSE mount viewing those data filesover network, as if being part of a virtual large data file in localfilesystem, and access (mostly reads, small fraction of writes) the datavia mmap. This way, parallel processes run on multi CPU cores of asingle computing node share the OS' kernel page for cache of thedataset, a program just assumes random access to the whole dataset asavailable at somewhere within its virtual address space.

Giving just enough physical RAM (in order to prevent thrashing) to thestorage server and computing nodes (my env currently have a typical sizeof 128GB per node), this achieves both simplicity of programming andefficient use of processor/memory/storage resources.


This architecture should scale well to datasets of a few TBs.

On 2020/3/16 上午3:46, Olaf Klinke wrote:

By the way, there are tools to retrieve a certain range from compressed data, 
which IMHO is a very cool feature of gzip.

https://www.htslib.org/doc/bgzip.html
https://www.htslib.org/doc/tabix.html

Bioinformaticians use it (among other things) for fast retrieval of genomic 
annotation from data sets with on the order of 10^9 keys (in case of human 
genome). Would be nice if someone wrote a Haskell binding.

Olaf

Hello,

I have a question regarding 'Data.Map' api, filed an issue
https://github.com/haskell/containers/issues/708

And may be I can ask here at the same time?

I'm not sure why|Data.Map|doesn't have a key range based visiting API, I
figured out I can do it this way:

|indexKeyRange :: IndexKey -> IndexKey -> Map IndexKey Object ->
[(IndexKey, Object)] indexKeyRange !minKey !maxKey = toList .
takeWhileAntitone (<= maxKey) . dropWhileAntitone (< minKey) |

But wouldn't it save the computation needed to re-balance the
intermediate tree generated ? Or that re-balancing can be optimized out
actually ?

I am creating an in-memory graph database, using|Data.Map.Strict.Map|as
business object indices with specified object attributes. The typical
scenario will be querying a small number of entries by key range, out of
possibly all business objects of a certain class globally, so the
implementation above would work, but not reasonable by far as it seems.

I think a lazy list returned by mere node visiting (i.e. no new node
creation) would satisfy my needs, or I missed something ?


Thanks,

Compl

_______________________________________________
Biohaskell mailing list
[email protected]
http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] [Haskell-cafe] Data.Map - visiting tree nodes withi a given key range ?

Reply via email to