[
https://issues.apache.org/jira/browse/CASSANDRA-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378348#comment-17378348
]
Caleb Rackliffe commented on CASSANDRA-16776:
---------------------------------------------
Thanks [~azotcsit]. Definitely appreciated!
> modify SecondaryIndexManager#indexPartition() to retrieve only columns for
> which indexes are actually being built
> -----------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-16776
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16776
> Project: Cassandra
> Issue Type: Improvement
> Components: Feature/2i Index
> Reporter: Caleb Rackliffe
> Assignee: Caleb Rackliffe
> Priority: Normal
> Fix For: 4.x
>
> Attachments: index1.png, index2.png
>
>
> Secondary indexes are (for the moment) built as special compaction tasks via
> {{SecondaryIndexBuilder}}. From a profiling perspective, the fun begins in
> {{SecondaryIndexManager.indexPartition()}}. The work above it in
> {{SecondaryIndexBuilder}} is just key iteration.
> !index1.png!
> Two basic things happen in {{indexPartition()}}. First, we read a single
> partition in its entirety, and then we send individual rows to the
> {{Indexer}}. When we read these partitions, we use {{ColumnFilter.all()}},
> which ends up materializing full rows, even when we’re indexing a single
> column (or at least fewer columns than we need for all the indexes
> participating in the build). If we narrowed this to fetch only the necessary
> columns, we might be able to create less garbage in
> {{AbstractBTreePartition#searchIterator()}} when we create a copy of the
> underlying full row from disk.
> In some initial testing, I’ve been using a simple schema with fairly narrow
> rows.
> {noformat}
> CREATE TABLE tlp_stress.allow_filtering (
> partition_id text,
> row_id int,
> payload text,
> value int,
> PRIMARY KEY (partition_id, row_id)
> ) WITH CLUSTERING ORDER BY (row_id ASC)
> {noformat}
> The price of deserializing these rows is still visible, however, in the
> results of some basic sampling profiling.
> !index2.png!
> The possible optimization above to avoid unnecessary copying of a row’s
> columns would also narrow cell deserialization only to indexed cells, which
> would probably be very beneficial for index builds with very wide rows. One
> minor wrinkle in all of this is that since 3.0, it has been possible to
> create indexes one entire rows, rather than single columns, so we’d have to
> keep that case in mind.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]