Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected
On Sat, Apr 11, 2009 at 9:53 PM, Prashant Malik wrote: > The results are a bit counter intuitive here I would have expected it to be > faster with the same FP rate but I am not sure why it is slower if you are > just using a couple of hash functions and using double hashing. Murmur is a higher-quality hash and takes more operations to achieve its better key distribution. But since the new implementation always uses two calls to Murmur no matter how many hashes are needed it is virtually constant time. > I am sorry I haven't looked at the test code but have you tried it with > large strings as keys ? e.g 128 byte keys , also with Longs. The random strings generated are 128 bytes. -Jonathan
Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected
The results are a bit counter intuitive here I would have expected it to be faster with the same FP rate but I am not sure why it is slower if you are just using a couple of hash functions and using double hashing. We had done these experiments a while back and found results matching with the theoretical tables. I am sorry I haven't looked at the test code but have you tried it with large strings as keys ? e.g 128 byte keys , also with Longs. But this is an interesting exercise for sure :). Thanks Prashant On Sat, Apr 11, 2009 at 4:31 PM, Sandeep Tata (JIRA) wrote: > >[ > https://issues.apache.org/jira/browse/CASSANDRA-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698145#action_12698145] > > Sandeep Tata commented on CASSANDRA-68: > --- > > I ran some more tests, here's what I found: > > Old code: > Test MAX_FP Actual FP > FalsePositivesInt/Random 0.10.142 > FalsePositivesInt/Random 0.01 (pass) > FalsePositivesInt/Random 0.001 (pass) > Words 0.10.15 > Words 0.01 (pass) > Words 0.0010.0013 > > New code: > > FalsePositivesInt/Random 0.1 (pass) > FalsePositivesInt/Random 0.01 (pass) > FalsePositivesInt/Random 0.001 (pass) > Words 0.1(pass) > Words 0.01 (pass) > Words 0.001(pass) > > The old bloomfilter certainly reports up to 50% more than expected false > positive rates for some cases. The new bloomfilter is more predictable, it > always passes. > > On my machine, some quick-n-crude tests show that the new bloom-filter is > about 4x slower. (I tested at FP rate = 0.01) . When you take into account > the fact that the penalty for a false positive at least 3 orders of > magnitude more expensive than the actual hash calculation (an FP usually > means you'll end up hitting disk unnecessarily), it makes sense to use it > even when you set the FP rate to 0.001. It is even more useful at higher > rates. > > > Bloom filters have much higher false-positive rate than expected > > > > > > Key: CASSANDRA-68 > > URL: https://issues.apache.org/jira/browse/CASSANDRA-68 > > Project: Cassandra > > Issue Type: Bug > >Reporter: Jonathan Ellis > >Assignee: Jonathan Ellis > > Fix For: 0.3 > > > > Attachments: > 0001-r-m-unused-code-including-entire-CountingBloomFilte.patch, > 0002-replace-JenkinsHash-w-MurmurHash.-its-hash-distrib.patch, > 0003-rename-BloomFilter.fill-add.patch, > 0004-rewrite-bloom-filters-to-use-murmur-hash-and-combina.patch, > 0004-v3.patch, 0004a-tests.patch, 0004b-code.patch, > 0005-switch-back-to-old-hash-generation-code-to-demonstra.patch, > fp_test_for_old_code.patch, fp_test_for_old_code_v2.patch, words.gz > > > > > > Gory details: > http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
Re: could cassandra be split into two parts?
We have talked about this possibility for a while of plugging in various stores into cassandra ,it should definitely be possible. The top layer replication semantics would be handled by cassandra but components like conflict resolution etc might have to be built for each storage engine. Overall this would be a good split. - Prashant On Sat, Apr 11, 2009 at 6:58 PM, Ian Holsman wrote: > > On 12/04/2009, at 11:44 AM, Sandeep Tata wrote: > > Depends on what exactly you have in mind ... >> >> Almost all of the storage engine logic is in the db package. I don't >> think it would be too hard to make this pluggable so you could slide >> in your own DB, say based on Derby/MySQL/BDB etc... I can see how >> specialized implementations of the database part could be useful for >> different apps. >> >> Do you expect that the API will still be the same put/get style thrift >> API ? Or are you hoping to expose the additional abilities of the >> underlying db through the thrift API ? That makes the question more >> interesting (and complicated). >> > > initially it could be done via the put/get api, as most things would work > that way (that I envisage). > but it would be nice to be able to be able to have custom API's implemented > via thrift, and having cassandra > just route the api to their required server and just pass it through. kind > of like a proxy. > > eg. > execSQL("cat", "select foo from bar where id='cat'") would use "cat" as the > key and route that to the appropriate mysql engine. > > I would hope cassandra could handle the replication component of it, not > mysql. > > > > >> On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman wrote: >> >>> hey. >>> >>> I was wondering how feasible it would be to de-couple the P2P layer of >>> cassandra from the storage engine. >>> I'd like to be able to plug in a non-column DB underneath, and use the >>> DHT >>> layer of cassandra. >>> >>> Is this something anyone else has considered doing? >>> -- >>> Ian Holsman >>> i...@holsman.net >>> >>> >>> >>> >>> > -- > Ian Holsman > i...@holsman.net > > > >
Re: could cassandra be split into two parts?
On 12/04/2009, at 11:44 AM, Sandeep Tata wrote: Depends on what exactly you have in mind ... Almost all of the storage engine logic is in the db package. I don't think it would be too hard to make this pluggable so you could slide in your own DB, say based on Derby/MySQL/BDB etc... I can see how specialized implementations of the database part could be useful for different apps. Do you expect that the API will still be the same put/get style thrift API ? Or are you hoping to expose the additional abilities of the underlying db through the thrift API ? That makes the question more interesting (and complicated). initially it could be done via the put/get api, as most things would work that way (that I envisage). but it would be nice to be able to be able to have custom API's implemented via thrift, and having cassandra just route the api to their required server and just pass it through. kind of like a proxy. eg. execSQL("cat", "select foo from bar where id='cat'") would use "cat" as the key and route that to the appropriate mysql engine. I would hope cassandra could handle the replication component of it, not mysql. On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman wrote: hey. I was wondering how feasible it would be to de-couple the P2P layer of cassandra from the storage engine. I'd like to be able to plug in a non-column DB underneath, and use the DHT layer of cassandra. Is this something anyone else has considered doing? -- Ian Holsman i...@holsman.net -- Ian Holsman i...@holsman.net
Re: could cassandra be split into two parts?
I implemented the pluggable storage layer into Dynamo. It really depends on what you want to establish. But theoritically it should be possible. Avinash On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman wrote: > hey. > > I was wondering how feasible it would be to de-couple the P2P layer of > cassandra from the storage engine. > I'd like to be able to plug in a non-column DB underneath, and use the DHT > layer of cassandra. > > Is this something anyone else has considered doing? > -- > Ian Holsman > i...@holsman.net > > > >
Re: could cassandra be split into two parts?
Depends on what exactly you have in mind ... Almost all of the storage engine logic is in the db package. I don't think it would be too hard to make this pluggable so you could slide in your own DB, say based on Derby/MySQL/BDB etc... I can see how specialized implementations of the database part could be useful for different apps. Do you expect that the API will still be the same put/get style thrift API ? Or are you hoping to expose the additional abilities of the underlying db through the thrift API ? That makes the question more interesting (and complicated). On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman wrote: > hey. > > I was wondering how feasible it would be to de-couple the P2P layer of > cassandra from the storage engine. > I'd like to be able to plug in a non-column DB underneath, and use the DHT > layer of cassandra. > > Is this something anyone else has considered doing? > -- > Ian Holsman > i...@holsman.net > > > >
could cassandra be split into two parts?
hey. I was wondering how feasible it would be to de-couple the P2P layer of cassandra from the storage engine. I'd like to be able to plug in a non-column DB underneath, and use the DHT layer of cassandra. Is this something anyone else has considered doing? -- Ian Holsman i...@holsman.net
Hudson build is back to normal: Cassandra #29
See http://hudson.zones.apache.org/hudson/job/Cassandra/29/changes