Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected

2009-04-11 Thread Jonathan Ellis
On Sat, Apr 11, 2009 at 9:53 PM, Prashant Malik  wrote:
> The results are a bit counter intuitive here I would have expected it to be
> faster with the same FP rate but   I am not sure why it is slower if you are
> just using a couple of hash functions and using double hashing.

Murmur is a higher-quality hash and takes more operations to achieve
its better key distribution.  But since the new implementation always
uses two calls to Murmur no matter how many hashes are needed it is
virtually constant time.

> I am sorry I haven't looked at the test code but have you tried it with
> large strings as keys ? e.g 128 byte keys , also with Longs.

The random strings generated are 128 bytes.

-Jonathan


Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected

2009-04-11 Thread Prashant Malik
The results are a bit counter intuitive here I would have expected it to be
faster with the same FP rate but   I am not sure why it is slower if you are
just using a couple of hash functions and using double hashing.

We had done these experiments a while back and found results matching with
the theoretical tables.

I am sorry I haven't looked at the test code but have you tried it with
large strings as keys ? e.g 128 byte keys , also with Longs.

But this is an interesting exercise for sure :).

Thanks
Prashant

On Sat, Apr 11, 2009 at 4:31 PM, Sandeep Tata (JIRA) wrote:

>
>[
> https://issues.apache.org/jira/browse/CASSANDRA-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698145#action_12698145]
>
> Sandeep Tata commented on CASSANDRA-68:
> ---
>
> I ran some more tests, here's what I found:
>
> Old code:
> Test MAX_FP Actual FP
> FalsePositivesInt/Random 0.10.142
> FalsePositivesInt/Random 0.01 (pass)
> FalsePositivesInt/Random 0.001   (pass)
> Words 0.10.15
> Words 0.01  (pass)
> Words 0.0010.0013
>
> New code:
>
> FalsePositivesInt/Random 0.1   (pass)
> FalsePositivesInt/Random 0.01 (pass)
> FalsePositivesInt/Random 0.001   (pass)
> Words 0.1(pass)
> Words 0.01  (pass)
> Words 0.001(pass)
>
> The old bloomfilter certainly reports up to 50% more than expected false
> positive rates for some cases. The new bloomfilter is more predictable, it
> always passes.
>
> On my machine, some quick-n-crude tests show that the new bloom-filter is
> about 4x slower. (I tested at FP rate = 0.01) . When you take into account
> the fact that the penalty for a false positive at least 3 orders of
> magnitude more expensive than the actual hash calculation (an FP usually
> means you'll end up hitting disk unnecessarily), it makes sense to use it
> even when you set the FP rate to 0.001. It is even more useful at higher
> rates.
>
> > Bloom filters have much higher false-positive rate than expected
> > 
> >
> > Key: CASSANDRA-68
> > URL: https://issues.apache.org/jira/browse/CASSANDRA-68
> > Project: Cassandra
> >  Issue Type: Bug
> >Reporter: Jonathan Ellis
> >Assignee: Jonathan Ellis
> > Fix For: 0.3
> >
> > Attachments:
> 0001-r-m-unused-code-including-entire-CountingBloomFilte.patch,
> 0002-replace-JenkinsHash-w-MurmurHash.-its-hash-distrib.patch,
> 0003-rename-BloomFilter.fill-add.patch,
> 0004-rewrite-bloom-filters-to-use-murmur-hash-and-combina.patch,
> 0004-v3.patch, 0004a-tests.patch, 0004b-code.patch,
> 0005-switch-back-to-old-hash-generation-code-to-demonstra.patch,
> fp_test_for_old_code.patch, fp_test_for_old_code_v2.patch, words.gz
> >
> >
> > Gory details:
> http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


Re: could cassandra be split into two parts?

2009-04-11 Thread Prashant Malik
We have talked about this possibility for a while of plugging in various
stores into cassandra ,it should definitely be possible.
The top layer replication semantics would be handled by cassandra but
components like conflict resolution etc might have to be built for each
storage engine.

Overall this would be a good split.

- Prashant

On Sat, Apr 11, 2009 at 6:58 PM, Ian Holsman  wrote:

>
> On 12/04/2009, at 11:44 AM, Sandeep Tata wrote:
>
>  Depends on what exactly you have in mind ...
>>
>> Almost all of the storage engine logic is in the db package. I don't
>> think it would be too hard to make this pluggable so you could slide
>> in your own DB, say based on Derby/MySQL/BDB etc... I can see how
>> specialized implementations of the database part could be useful for
>> different apps.
>>
>> Do you expect that the API will still be the same put/get style thrift
>> API ? Or are you hoping to expose the additional abilities of the
>> underlying db through the thrift API ? That makes the question more
>> interesting (and complicated).
>>
>
> initially it could be done via the put/get api, as most things would work
> that way (that I envisage).
> but it would be nice to be able to be able to have custom API's implemented
> via thrift, and having cassandra
> just route the api to their required server and just pass it through. kind
> of like a proxy.
>
> eg.
> execSQL("cat", "select foo from bar where id='cat'") would use "cat" as the
> key and route that to the appropriate mysql engine.
>
> I would hope cassandra could handle the replication component of it, not
> mysql.
>
>
>
>
>> On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman  wrote:
>>
>>> hey.
>>>
>>> I was wondering how feasible it would be to de-couple the P2P layer of
>>> cassandra from the storage engine.
>>> I'd like to be able to plug in a non-column DB underneath, and use the
>>> DHT
>>> layer of cassandra.
>>>
>>> Is this something anyone else has considered doing?
>>> --
>>> Ian Holsman
>>> i...@holsman.net
>>>
>>>
>>>
>>>
>>>
> --
> Ian Holsman
> i...@holsman.net
>
>
>
>


Re: could cassandra be split into two parts?

2009-04-11 Thread Ian Holsman


On 12/04/2009, at 11:44 AM, Sandeep Tata wrote:


Depends on what exactly you have in mind ...

Almost all of the storage engine logic is in the db package. I don't
think it would be too hard to make this pluggable so you could slide
in your own DB, say based on Derby/MySQL/BDB etc... I can see how
specialized implementations of the database part could be useful for
different apps.

Do you expect that the API will still be the same put/get style thrift
API ? Or are you hoping to expose the additional abilities of the
underlying db through the thrift API ? That makes the question more
interesting (and complicated).


initially it could be done via the put/get api, as most things would  
work that way (that I envisage).
but it would be nice to be able to be able to have custom API's  
implemented via thrift, and having cassandra
just route the api to their required server and just pass it through.  
kind of like a proxy.


eg.
execSQL("cat", "select foo from bar where id='cat'") would use "cat"  
as the key and route that to the appropriate mysql engine.


I would hope cassandra could handle the replication component of it,  
not mysql.





On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman  wrote:

hey.

I was wondering how feasible it would be to de-couple the P2P layer  
of

cassandra from the storage engine.
I'd like to be able to plug in a non-column DB underneath, and use  
the DHT

layer of cassandra.

Is this something anyone else has considered doing?
--
Ian Holsman
i...@holsman.net






--
Ian Holsman
i...@holsman.net





Re: could cassandra be split into two parts?

2009-04-11 Thread Avinash Lakshman
I implemented the pluggable storage layer into Dynamo. It really depends on
what you want to establish. But theoritically it should be possible.
Avinash

On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman  wrote:

> hey.
>
> I was wondering how feasible it would be to de-couple the P2P layer of
> cassandra from the storage engine.
> I'd like to be able to plug in a non-column DB underneath, and use the DHT
> layer of cassandra.
>
> Is this something anyone else has considered doing?
> --
> Ian Holsman
> i...@holsman.net
>
>
>
>


Re: could cassandra be split into two parts?

2009-04-11 Thread Sandeep Tata
Depends on what exactly you have in mind ...

Almost all of the storage engine logic is in the db package. I don't
think it would be too hard to make this pluggable so you could slide
in your own DB, say based on Derby/MySQL/BDB etc... I can see how
specialized implementations of the database part could be useful for
different apps.

Do you expect that the API will still be the same put/get style thrift
API ? Or are you hoping to expose the additional abilities of the
underlying db through the thrift API ? That makes the question more
interesting (and complicated).


On Sat, Apr 11, 2009 at 6:33 PM, Ian Holsman  wrote:
> hey.
>
> I was wondering how feasible it would be to de-couple the P2P layer of
> cassandra from the storage engine.
> I'd like to be able to plug in a non-column DB underneath, and use the DHT
> layer of cassandra.
>
> Is this something anyone else has considered doing?
> --
> Ian Holsman
> i...@holsman.net
>
>
>
>


could cassandra be split into two parts?

2009-04-11 Thread Ian Holsman

hey.

I was wondering how feasible it would be to de-couple the P2P layer of  
cassandra from the storage engine.
I'd like to be able to plug in a non-column DB underneath, and use the  
DHT layer of cassandra.


Is this something anyone else has considered doing?
--
Ian Holsman
i...@holsman.net





Hudson build is back to normal: Cassandra #29

2009-04-11 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Cassandra/29/changes