[jira] [Comment Edited] (CASSANDRA-16776) modify SecondaryIndexManager#indexPartition() to retrieve only columns for which indexes are actually being built

Caleb Rackliffe (Jira) Thu, 05 Aug 2021 10:05:07 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376010#comment-17376010
 ]


Caleb Rackliffe edited comment on CASSANDRA-16776 at 8/5/21, 5:04 PM:
----------------------------------------------------------------------

[trunk|https://github.com/apache/cassandra/pull/1098]
 
[CircleCI|https://app.circleci.com/pipelines/github/maedhroz/cassandra?branch=CASSANDRA-16776]

To see how the patch reduces allocations, enable compaction profiling in 
{{CompactionAllocationTest}} and run the test 
{{widePartitionsSingleIndexedColumn}}. (This test indexes only one of 4 normal 
columns.) You should get about a 13% improvement in bytes allocated for index 
builds.

ex.
{noformat}
INFO  [main] 2021-07-06 15:16:01,720 CompactionAllocationTest.java:466 - *** 
widePartitionsSingleIndexedColumn compaction summary
INFO  [main] 2021-07-06 15:16:01,720 CompactionAllocationTest.java:467 - 
463337000 bytes, 13099437 objects, 2145078 /partition, 2145 /row, 0 cpu
{noformat}
...then with the patch...
{noformat}
INFO  [main] 2021-07-06 15:11:51,958 CompactionAllocationTest.java:466 - *** 
widePartitionsSingleIndexedColumn compaction summary
INFO  [main] 2021-07-06 15:11:51,958 CompactionAllocationTest.java:467 - 
402830648 bytes, 11802336 objects, 1864956 /partition, 1864 /row, 0 cpu
{noformat}


was (Author: maedhroz):
[trunk|https://github.com/apache/cassandra/pull/1098]
 [CircleCI 
J8|https://app.circleci.com/pipelines/github/maedhroz/cassandra/284/workflows/d095f8c2-d17d-4f9f-b5dd-0ab50f98901f]
 [CircleCI 
J11|https://app.circleci.com/pipelines/github/maedhroz/cassandra/284/workflows/4664cf14-7e0c-4680-8154-0a4fd340770a]

To see how the patch reduces allocations, enable compaction profiling in 
{{CompactionAllocationTest}} and run the test 
{{widePartitionsSingleIndexedColumn}}. (This test indexes only one of 4 normal 
columns.) You should get about a 13% improvement in bytes allocated for index 
builds.

ex.
{noformat}
INFO  [main] 2021-07-06 15:16:01,720 CompactionAllocationTest.java:466 - *** 
widePartitionsSingleIndexedColumn compaction summary
INFO  [main] 2021-07-06 15:16:01,720 CompactionAllocationTest.java:467 - 
463337000 bytes, 13099437 objects, 2145078 /partition, 2145 /row, 0 cpu
{noformat}
...then with the patch...
{noformat}
INFO  [main] 2021-07-06 15:11:51,958 CompactionAllocationTest.java:466 - *** 
widePartitionsSingleIndexedColumn compaction summary
INFO  [main] 2021-07-06 15:11:51,958 CompactionAllocationTest.java:467 - 
402830648 bytes, 11802336 objects, 1864956 /partition, 1864 /row, 0 cpu
{noformat}

> modify SecondaryIndexManager#indexPartition() to retrieve only columns for 
> which indexes are actually being built
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16776
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16776
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Feature/2i Index
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.x
>
>         Attachments: index1.png, index2.png
>
>
> Secondary indexes are (for the moment) built as special compaction tasks via 
> {{SecondaryIndexBuilder}}. From a profiling perspective, the fun begins in 
> {{SecondaryIndexManager.indexPartition()}}. The work above it in 
> {{SecondaryIndexBuilder}} is just key iteration.
>  !index1.png! 
> Two basic things happen in {{indexPartition()}}. First, we read a single 
> partition in its entirety, and then we send individual rows to the 
> {{Indexer}}. When we read these partitions, we use {{ColumnFilter.all()}}, 
> which ends up materializing full rows, even when we’re indexing a single 
> column (or at least fewer columns than we need for all the indexes 
> participating in the build). If we narrowed this to fetch only the necessary 
> columns, we might be able to create less garbage in 
> {{AbstractBTreePartition#searchIterator()}} when we create a copy of the 
> underlying full row from disk.
> In some initial testing, I’ve been using a simple schema with fairly narrow 
> rows.
> {noformat}
> CREATE TABLE tlp_stress.allow_filtering (
>     partition_id text,
>     row_id int,
>     payload text,
>     value int,
>     PRIMARY KEY (partition_id, row_id)
> ) WITH CLUSTERING ORDER BY (row_id ASC)
> {noformat}
> The price of deserializing these rows is still visible, however, in the 
> results of some basic sampling profiling.
>  !index2.png! 
> The possible optimization above to avoid unnecessary copying of a row’s 
> columns would also narrow cell deserialization only to indexed cells, which 
> would probably be very beneficial for index builds with very wide rows. One 
> minor wrinkle in all of this is that since 3.0, it has been possible to 
> create indexes one entire rows, rather than single columns, so we’d have to 
> keep that case in mind.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-16776) modify SecondaryIndexManager#indexPartition() to retrieve only columns for which indexes are actually being built

Reply via email to