[jira] [Commented] (CASSANDRA-4258) Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive?
[ https://issues.apache.org/jira/browse/CASSANDRA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278856#comment-13278856 ] Brandon Williams commented on CASSANDRA-4258: - bq. 1) We can have some sorting of bloom-filters based on logic like the bloom filter of the sstable which resulted into successfully serving the read request will have higher priority over other bloom filters. I mean we will go for the bloom filter of the sstable which is most recently accessed and which successfully returned the requested columns.(MRU approach, As the probability of getting result from MRU sstable is greater).This way we can reduce the disk access. Row-level BFs are in memory, so there is no disk access except for the case of false positives. It doesn't sound like your access pattern involves asking for non-existent keys, however. bq. 2) The point is we should have some sort of logic for sorting of bloom filters to boost the read performance in case where sstables are not yet compacted. I don't see any logical way of sorting BFs, but as I said, there is no disk access. Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive? Key: CASSANDRA-4258 URL: https://issues.apache.org/jira/browse/CASSANDRA-4258 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 1.1.1 Reporter: Samarth Gahire Assignee: Jonathan Ellis Priority: Minor Labels: bloom-filter, read Fix For: 1.1.1 Original Estimate: 336h Remaining Estimate: 336h I was just wondering if there is any logic for which bloom filter should be checked first to increase the probability of getting the result and not just minimizing the probability of false positive. ( *Note:* I have checked into the code and I am not talking about *Getting BloomFilter with the lowest practical false positive probability* OR *Getting smallest BloomFilter that can provide the given false positive probability rate for the given number of elements.* ) *Consider following Scenario:* 1) In our Cassandra Cluster we are inserting 130 millions of rows on daily basis for single column family and practically we cant keep this data compacted always.(As the loading time is much and compaction may take too much time that could affect the schedule for loading of data for next day ) 2) We are inserting same rowkeys(values of all the 130 millions rows are same) everyday with different supercolumn. {code} For date 20120101 we have super_CF= {row_1:{_super_column_20120101:{ col1 : val1, col2 : val2 }} row_2:{_super_column_20120101:{ col1 : val3, col2 : val4 }} row_3:{_super_column_20120101:{ col1 : val5, col2 : val6 }} } and For date 20120102 it will be like super_CF= {row_1:{_super_column_20120102:{ col1 : val7, col2 : val8 }} row_2:{_super_column_20120102:{ col1 : val9, col2 : val10 }} row_3:{_super_column_20120102:{ col1 : val11, col2 : val12 }} } Note that set of rowkeys is same for all the days only supercolumn changes {code} 3) So if we do not compact the data say for 30 days, each row key is present in 30 different sstables. 4) So in worst case, even with 0 probability of false positive, there could be 30 unnecessary disk accesses. 5) Because of this scenario we are experiencing extremely degraded read performance. *Proposed solution:* 1) We can have some sorting of bloom-filters based on logic like the bloom filter of the sstable which resulted into successfully serving the read request will have higher priority over other bloom filters. I mean we will go for the bloom filter of the sstable which is most recently accessed and which successfully returned the requested columns.(MRU approach, As the probability of getting result from MRU sstable is greater).This way we can reduce the disk access. 2) The point is we should have some sort of logic for sorting of bloom filters to boost the read performance in case where sstables are not yet compacted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4258) Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive?
[ https://issues.apache.org/jira/browse/CASSANDRA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278965#comment-13278965 ] Samarth Gahire commented on CASSANDRA-4258: --- Please try to understand the scenario I am talking about. Even if it is not a false positive in this scenario there will be a disk access, a row key will be there in sstable but not the super column we are looking for. So this will be unnecessary disk access. e.g. one row having content in all the sstables . might be in case of super-column OR simple columns In such cases deciding on correct bloom filter will certainly save some disk access and improve on read performance. Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive? Key: CASSANDRA-4258 URL: https://issues.apache.org/jira/browse/CASSANDRA-4258 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 1.1.1 Reporter: Samarth Gahire Priority: Minor Labels: bloom-filter, read Fix For: 1.1.1 Original Estimate: 336h Remaining Estimate: 336h I was just wondering if there is any logic for which bloom filter should be checked first to increase the probability of getting the result and not just minimizing the probability of false positive. ( *Note:* I have checked into the code and I am not talking about *Getting BloomFilter with the lowest practical false positive probability* OR *Getting smallest BloomFilter that can provide the given false positive probability rate for the given number of elements.* ) *Consider following Scenario:* 1) In our Cassandra Cluster we are inserting 130 millions of rows on daily basis for single column family and practically we cant keep this data compacted always.(As the loading time is much and compaction may take too much time that could affect the schedule for loading of data for next day ) 2) We are inserting same rowkeys(values of all the 130 millions rows are same) everyday with different supercolumn. {code} For date 20120101 we have super_CF= {row_1:{_super_column_20120101:{ col1 : val1, col2 : val2 }} row_2:{_super_column_20120101:{ col1 : val3, col2 : val4 }} row_3:{_super_column_20120101:{ col1 : val5, col2 : val6 }} } and For date 20120102 it will be like super_CF= {row_1:{_super_column_20120102:{ col1 : val7, col2 : val8 }} row_2:{_super_column_20120102:{ col1 : val9, col2 : val10 }} row_3:{_super_column_20120102:{ col1 : val11, col2 : val12 }} } Note that set of rowkeys is same for all the days only supercolumn changes {code} 3) So if we do not compact the data say for 30 days, each row key is present in 30 different sstables. 4) So in worst case, even with 0 probability of false positive, there could be 30 unnecessary disk accesses. 5) Because of this scenario we are experiencing extremely degraded read performance. *Proposed solution:* 1) We can have some sorting of bloom-filters based on logic like the bloom filter of the sstable which resulted into successfully serving the read request will have higher priority over other bloom filters. I mean we will go for the bloom filter of the sstable which is most recently accessed and which successfully returned the requested columns.(MRU approach, As the probability of getting result from MRU sstable is greater).This way we can reduce the disk access. 2) The point is we should have some sort of logic for sorting of bloom filters to boost the read performance in case where sstables are not yet compacted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4258) Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive?
[ https://issues.apache.org/jira/browse/CASSANDRA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278971#comment-13278971 ] Jonathan Ellis commented on CASSANDRA-4258: --- Please read the issue I linked. Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive? Key: CASSANDRA-4258 URL: https://issues.apache.org/jira/browse/CASSANDRA-4258 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 1.1.1 Reporter: Samarth Gahire Priority: Minor Labels: bloom-filter, read Fix For: 1.1.1 Original Estimate: 336h Remaining Estimate: 336h I was just wondering if there is any logic for which bloom filter should be checked first to increase the probability of getting the result and not just minimizing the probability of false positive. ( *Note:* I have checked into the code and I am not talking about *Getting BloomFilter with the lowest practical false positive probability* OR *Getting smallest BloomFilter that can provide the given false positive probability rate for the given number of elements.* ) *Consider following Scenario:* 1) In our Cassandra Cluster we are inserting 130 millions of rows on daily basis for single column family and practically we cant keep this data compacted always.(As the loading time is much and compaction may take too much time that could affect the schedule for loading of data for next day ) 2) We are inserting same rowkeys(values of all the 130 millions rows are same) everyday with different supercolumn. {code} For date 20120101 we have super_CF= {row_1:{_super_column_20120101:{ col1 : val1, col2 : val2 }} row_2:{_super_column_20120101:{ col1 : val3, col2 : val4 }} row_3:{_super_column_20120101:{ col1 : val5, col2 : val6 }} } and For date 20120102 it will be like super_CF= {row_1:{_super_column_20120102:{ col1 : val7, col2 : val8 }} row_2:{_super_column_20120102:{ col1 : val9, col2 : val10 }} row_3:{_super_column_20120102:{ col1 : val11, col2 : val12 }} } Note that set of rowkeys is same for all the days only supercolumn changes {code} 3) So if we do not compact the data say for 30 days, each row key is present in 30 different sstables. 4) So in worst case, even with 0 probability of false positive, there could be 30 unnecessary disk accesses. 5) Because of this scenario we are experiencing extremely degraded read performance. *Proposed solution:* 1) We can have some sorting of bloom-filters based on logic like the bloom filter of the sstable which resulted into successfully serving the read request will have higher priority over other bloom filters. I mean we will go for the bloom filter of the sstable which is most recently accessed and which successfully returned the requested columns.(MRU approach, As the probability of getting result from MRU sstable is greater).This way we can reduce the disk access. 2) The point is we should have some sort of logic for sorting of bloom filters to boost the read performance in case where sstables are not yet compacted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira