[
https://issues.apache.org/jira/browse/CASSANDRA-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202876#comment-15202876
]
Jack Krupansky commented on CASSANDRA-11383:
--------------------------------------------
The int field could easily be made a text field if that would make SASI work
better (you can even do prefix query by year then.)
Point 1 is precisely what SASI SPARSE is designed for. It also is what
Materialized Views (formerly Global Indexes) is for and MV is even better for
since it eliminates the need to scan multiple nodes since the rows get
collected based on the new partition key that can include the indexed data
value.
You're using cardinality backwards - it is supposed to be a measure of the
number of distinct values in a column, not the number of rows containing each
value. See: https://en.wikipedia.org/wiki/Cardinality_%28SQL_statements%29.
Granted, in ERD cardinality is the count of rows in a second table for each
column value in a given table (one to n, n to one, etc.), but in the context of
an index there is only one table involved, although you could consider the
index to be a table, but that would be a little odd. In any case, best to stick
with the standard SQL meaning of the cardinality of data values in a column.
So, to be clear, an email address is high cardinality and gender is low
cardinality. And the end of month int field is low cardinality or not dense in
the original SASI doc terminology.
> SASI index build leads to massive OOM
> -------------------------------------
>
> Key: CASSANDRA-11383
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11383
> Project: Cassandra
> Issue Type: Bug
> Components: CQL
> Environment: C* 3.4
> Reporter: DOAN DuyHai
> Attachments: CASSANDRA-11383.patch, new_system_log_CMS_8GB_OOM.log,
> system.log_sasi_build_oom
>
>
> 13 bare metal machines
> - 6 cores CPU (12 HT)
> - 64Gb RAM
> - 4 SSD in RAID0
> JVM settings:
> - G1 GC
> - Xms32G, Xmx32G
> Data set:
> - ≈ 100Gb/per node
> - 1.3 Tb cluster-wide
> - ≈ 20Gb for all SASI indices
> C* settings:
> - concurrent_compactors: 1
> - compaction_throughput_mb_per_sec: 256
> - memtable_heap_space_in_mb: 2048
> - memtable_offheap_space_in_mb: 2048
> I created 9 SASI indices
> - 8 indices with text field, NonTokenizingAnalyser, PREFIX mode,
> case-insensitive
> - 1 index with numeric field, SPARSE mode
> After a while, the nodes just gone OOM.
> I attach log files. You can see a lot of GC happening while index segments
> are flush to disk. At some point the node OOM ...
> /cc [~xedin]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)