[ 
https://issues.apache.org/jira/browse/CASSANDRA-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-6446:
----------------------------------------

    Attachment: 0002-6446-Read-patch-v2.txt
                0001-6446-write-path-v2.txt

bq.  All we know at the moment is, that having it 10000 does not degraded 
performance for small memtables and solved write timeout problems with large 
ones

Fair enough, let's leave it a that for now.

bq.  If we make 2 calls to the existing searchInternal, we could have situation 
when both start and end do not fit into any range

Ok, but that just mean we need to slightly modify searchInternal to return the 
"insertion point". My point was that the new searchInternal was very close to 
being 2 copies of the existing one, and surely we can do better.

bq. The proposed way, code would be more complicated and probably error prone.

Pretending it complicate things in a meaningful way is a bit disingenuous imo.  
Iterating over names and checking if a range covers it is hardly "more 
complicated and error prone" than iterating over ranges and checking if a name 
is covered.

bq. While multiple Name/Slines queries are rare

Well, for names, this is exactly what I'm saying: since the number of names 
will likely be small, there really is no point in iterating over all ranges 
like your patch does, especially if we optimize for wide partitions with lots 
of range tombstones (which is exactly what this ticket is about).

For slices, CQL3 does generate queries with multiple slices and in that case 
the slices may very well have large "holes" between what they select. And given 
that iterating over slices doesn't seem all that complicated to me, I'd rather 
not left this case poorly optimized for no good reason.

So anyway, attaching a v2 of the read patch that fixes the code style and 
incorporates the modifications I think we should do. Attaching a rebase of the 
write patch against trunk too.


> Faster range tombstones on wide partitions
> ------------------------------------------
>
>                 Key: CASSANDRA-6446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6446
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Oleg Anastasyev
>            Assignee: Oleg Anastasyev
>             Fix For: 2.1
>
>         Attachments: 0001-6446-write-path-v2.txt, 
> 0002-6446-Read-patch-v2.txt, RangeTombstonesReadOptimization.diff, 
> RangeTombstonesWriteOptimization.diff
>
>
> Having wide CQL rows (~1M in single partition) and after deleting some of 
> them, we found inefficiencies in handling of range tombstones on both write 
> and read paths.
> I attached 2 patches here, one for write path 
> (RangeTombstonesWriteOptimization.diff) and another on read 
> (RangeTombstonesReadOptimization.diff).
> On write path, when you have some CQL rows deletions by primary key, each of 
> deletion is represented by range tombstone. On put of this tombstone to 
> memtable the original code takes all columns from memtable from partition and 
> checks DeletionInfo.isDeleted by brute for loop to decide, should this column 
> stay in memtable or it was deleted by new tombstone. Needless to say, more 
> columns you have on partition the slower deletions you have heating your CPU 
> with brute range tombstones check. 
> The RangeTombstonesWriteOptimization.diff patch for partitions with more than 
> 10000 columns loops by tombstones instead and checks existance of columns for 
> each of them. Also it copies of whole memtable range tombstone list only if 
> there are changes to be made there (original code copies range tombstone list 
> on every write).
> On read path, original code scans whole range tombstone list of a partition 
> to match sstable columns to their range tomstones. The 
> RangeTombstonesReadOptimization.diff patch scans only necessary range of 
> tombstones, according to filter used for read.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to