[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756105#comment-17756105 ] Caleb Rackliffe commented on CASSANDRA-18673: - An additional repeat run for \{{StorageAttachedIndexDDLTest}} looks green, and all other failures are existing/unrelated. Moving to commit... > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Fix For: 5.0.x, 5.x > > Time Spent: 6.5h > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752874#comment-17752874 ] Caleb Rackliffe commented on CASSANDRA-18673: - +1 I'm guessing the 5.0 and trunk patches will be identical, since we just created {{cassandra-5.0}}... > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Time Spent: 6.5h > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752608#comment-17752608 ] Caleb Rackliffe commented on CASSANDRA-18673: - Finished w/ my pass at review, and left my comments in the PR. > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Time Spent: 6.5h > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17749857#comment-17749857 ] Mike Adamson commented on CASSANDRA-18673: -- [~maedhroz] I have attached a new PR to this ticket. This patch does the following: * Removes the primary key trie on-disk component * Adds a partition sizes on-disk component * Adds a partitionedSeekToTerm to SortedTermsReader.Cursor * Creates separate SkinnyRowAwarePrimaryKeyMap and WideRowAwarePrimaryKeyMap components > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Time Spent: 6.5h > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747173#comment-17747173 ] Caleb Rackliffe commented on CASSANDRA-18673: - For anyone watching, there are still some issues w/ how we handle/compress more complex primary keys. Once we've addressed those, this will move back into review... > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Time Spent: 6h 20m > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745342#comment-17745342 ] Caleb Rackliffe commented on CASSANDRA-18673: - Made a first pass at this and left some comments. Overall, things are looking pretty good and CI is clean... > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Time Spent: 4h 10m > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745247#comment-17745247 ] Caleb Rackliffe commented on CASSANDRA-18673: - [~mike_tr_adamson] Reviewing now, but want to make sure we don't forget to throw up a Phase 2 Jira for removing the sorted terms entirely in favor of {{row ID -> trie node ID}} map + collecting the PK from the trie itself... > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Time Spent: 1h 10m > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745152#comment-17745152 ] Mike Adamson commented on CASSANDRA-18673: -- I have completed some performance runs against this branch and the current CEP branch. This loaded 1B rows with the following schema: {noformat} create table if not exists TEMPLATE(keyspace,test).TEMPLATE(table,sai) ( id bigint, time timestamp, value int, lc int, tag text, PRIMARY KEY (id) ); CREATE CUSTOM INDEX IF NOT EXISTS ON TEMPLATE(keyspace:test).TEMPLATE(table:sai) (time) USING 'StorageAttachedIndex'; CREATE CUSTOM INDEX IF NOT EXISTS ON TEMPLATE(keyspace:test).TEMPLATE(table:sai) (value) USING 'StorageAttachedIndex'; CREATE CUSTOM INDEX IF NOT EXISTS ON TEMPLATE(keyspace:test).TEMPLATE(table:sai) (lc) USING 'StorageAttachedIndex'; CREATE CUSTOM INDEX IF NOT EXISTS ON TEMPLATE(keyspace:test).TEMPLATE(table:sai) (tag) USING 'StorageAttachedIndex'; {noformat} Data was loaded into the time, value & tag columns. ||Branch||SSTable Size GB||Per-SSTable Index Components GB||Tag Index GB||Time Index GB||Value Index GB||SAI Total GB|| |CEP|48|70|2|7|7|87| |CASSANDRA-18673|48|13|2|7|7|29| > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > Time Spent: 1h 10m > Remaining Estimate: 0h > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18673) Reduce size of per-SSTable index components
[ https://issues.apache.org/jira/browse/CASSANDRA-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744687#comment-17744687 ] Mike Adamson commented on CASSANDRA-18673: -- This patch introduces the following changes: * The token is no longer included in the primary key data stored in the sorted terms and primary key trie. This allows the sorted terms and the primary key trie to correctly prefix compress the primary keys. This was not possible with the token at the start of the stored data. * To cater for the primary keys no longer being in lexicographic order, the primary key trie is now segmented to allow the keys to be sorted in memory first. * The NamedMemoryLimiter has been renamed the SegmentMemoryLimiter and simplified in its usage. This allows it to more easily be used by the SegmentBuilder for per-column indexes and by the primary key trie. * The LongArray can now search for rowIds by token making it bidirectional. * The primary key trie is only written for wide tables. If the table has no clustering then the rowId can be read from the token LongArray making the trie redundant. > Reduce size of per-SSTable index components > --- > > Key: CASSANDRA-18673 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18673 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI >Reporter: Mike Adamson >Assignee: Mike Adamson >Priority: Urgent > > The current per-SSTable index components are large because the primary keys > that are stored in them include the token as part of the byte comparable. The > byte comparable puts the token first meaning that we get very little prefix > compression from either the trie or the sorted terms store. > We can fix this by removing the token from the primary key serialization. > This would allow us to get the prefix compression from the trie and the > sorted terms store. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org