Ah - I see. It's generating multiple duplicate timestamps per millisecond, so there are fewer than 50mm unique strings. Duplicates just require incrementing a counter. Agree it's very cool though!
sent from my phone On Jun 3, 2011 9:02 PM, "Jason Rutherglen" <jason.rutherg...@gmail.com> wrote: > Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz > > You can see the line that's adding the string: > > fstBuilder.add(new BytesRef(date), new Long(x)); > > On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <mcor...@hotpads.com> wrote: >> Jason - are you feeding it that whole string for each date? Input data is >> 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? >> Is it possible to compress by that much? Maybe I'm missing something about >> how the FST works. >> >> Matt >> >> >> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen < jason.rutherg...@gmail.com >>> wrote: >> >>> Also the next thing to measure with the FST is the key lookup speed. >>> I'm not sure what that'd look like, or how to compare with HBase right >>> now? >>> >>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen >>> <jason.rutherg...@gmail.com> wrote: >>> > Here's a nice preliminary number with the FST, 50 million dates of the >>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The >>> > FST is 984 bytes, with an incrementing long to point to the presumably >>> > MMap'd value data. This's a bit crazy. >>> > >>> > Perhaps we should try other increments as well? Given that HBase keys >>> > especially are probably close increments of each other, I think the >>> > FST can always be loaded into RAM with pointers out to the actual >>> > values. >>> > >>> >>