Thanks for the response. That sounds sequence comparison seems very impressive
On Wed, Oct 6, 2010 at 2:23 PM, Ketil Malde <ke...@malde.org> wrote: > Hemanth Kapila <saihema...@gmail.com> writes: > > > Let us say, we are using a bit-array of size 2^43 (that is, a byte array > of > > size 2^40) to store a bloom filter. And let us further assume that we are > > interested in a false-positive probability of 0.01 > > Since we are just making up numbers, let us instead say we are using a > bit array of size 2^32 - still too large for Int indexing (which was the > issue here) but "only" 500MB in size. > > > That means, I will be able to use this array to represent a set of > > cardinality 9.18e11 ~ 10^12 > > ...bringing it down to less than 10^9, easily reached for building an > indexing of k-words (tuples) in e.g. the human genome (3GB). > > But: I'm about to start analyzing Illumina sequencing data, where we have > sequences from two individuals of the same species. I'm interesting in > the differences between these species, so I might want to index both > data sets and screen the sequences of each against the other to identify > bits that don't appear in the other. Since I'll have easily tens, maybe > a hundred gigabytes of sequence from each, it's not impossible to get > into the territory you describe. > > (In practice, things will be limited by available RAM, sadly still some > orders of magnitude less than 2^40 - although apparently SGI can deliver > it. I guess I just need a bigger budget.) > > > I was curious to know what sort of programs would be dealing with sets of > > 10^12 elements. > > Most likely a program using mutation, and probably not copying, > multi-generation GC. > > Other data sets that have considerable size are acoustics data (I > understand a modern sonar deliver about a Gbit/sec continously), and > seismic data. > > Also, for more moderate bodies of data and more complex analysis, you > might want to use less frugal data structure, like suffix arrays. > > -k > -- > If I haven't seen further, it is by standing in the footprints of giants > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe >
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe