[jira] [Commented] (CASSANDRA-16776) modify SecondaryIndexManager#indexPartition() to retrieve only columns for which indexes are actually being built

Caleb Rackliffe (Jira) Fri, 09 Jul 2021 16:18:06 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378348#comment-17378348
 ]


Caleb Rackliffe commented on CASSANDRA-16776:
---------------------------------------------

Thanks [~azotcsit]. Definitely appreciated!

> modify SecondaryIndexManager#indexPartition() to retrieve only columns for 
> which indexes are actually being built
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16776
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16776
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Feature/2i Index
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.x
>
>         Attachments: index1.png, index2.png
>
>
> Secondary indexes are (for the moment) built as special compaction tasks via 
> {{SecondaryIndexBuilder}}. From a profiling perspective, the fun begins in 
> {{SecondaryIndexManager.indexPartition()}}. The work above it in 
> {{SecondaryIndexBuilder}} is just key iteration.
>  !index1.png! 
> Two basic things happen in {{indexPartition()}}. First, we read a single 
> partition in its entirety, and then we send individual rows to the 
> {{Indexer}}. When we read these partitions, we use {{ColumnFilter.all()}}, 
> which ends up materializing full rows, even when we’re indexing a single 
> column (or at least fewer columns than we need for all the indexes 
> participating in the build). If we narrowed this to fetch only the necessary 
> columns, we might be able to create less garbage in 
> {{AbstractBTreePartition#searchIterator()}} when we create a copy of the 
> underlying full row from disk.
> In some initial testing, I’ve been using a simple schema with fairly narrow 
> rows.
> {noformat}
> CREATE TABLE tlp_stress.allow_filtering (
>     partition_id text,
>     row_id int,
>     payload text,
>     value int,
>     PRIMARY KEY (partition_id, row_id)
> ) WITH CLUSTERING ORDER BY (row_id ASC)
> {noformat}
> The price of deserializing these rows is still visible, however, in the 
> results of some basic sampling profiling.
>  !index2.png! 
> The possible optimization above to avoid unnecessary copying of a row’s 
> columns would also narrow cell deserialization only to indexed cells, which 
> would probably be very beneficial for index builds with very wide rows. One 
> minor wrinkle in all of this is that since 3.0, it has been possible to 
> create indexes one entire rows, rather than single columns, so we’d have to 
> keep that case in mind.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-16776) modify SecondaryIndexManager#indexPartition() to retrieve only columns for which indexes are actually being built

Reply via email to