Re: [PR] CASSANDRA-21078 move training params to CQL [cassandra]

via GitHub Sun, 21 Dec 2025 23:00:00 -0800


yifan-c commented on code in PR #4523:
URL: https://github.com/apache/cassandra/pull/4523#discussion_r2638816697



##########
src/java/org/apache/cassandra/tools/nodetool/CompressionDictionaryCommandGroup.java:
##########
@@ -68,18 +75,40 @@ public static class TrainDictionary extends AbstractCommand
         @Option(names = { "-f", "--force" }, description = "Force the 
dictionary training even if there are not enough samples")
         private boolean force = false;
 
+        @Option(names = {"--max-dict-size"}, description = "Maximum size of a 
trained compression dictionary. " +
+                                                           "Larger 
dictionaries may provide better compression but use more memory. When not set, 
" +
+                                                           "the value from 
compression configuration from CQL for a given table is used. " +
+                                                           "The default value 
is " + DEFAULT_TRAINING_MAX_DICTIONARY_SIZE_PARAMETER_VALUE + '.')
+        private String trainingMaxDictionarySize;
+
+        @Option(names = "--max-total-sample-size", description = "Maximum 
total size of sample data to collect for dictionary training. " +
+                                                                 "More sample 
data generally produces better dictionaries but takes longer to train. " +
+                                                                 "The 
recommended sample size is 100x the dictionary size. When not set, " +
+                                                                 "the value 
from compression configuration from CQL for a give table is used. " +
+                                                                 "The default 
value is " + DEFAULT_TRAINING_MAX_TOTAL_SAMPLE_SIZE_PARAMETER_VALUE + '.')
+        private String trainingMaxTotalSampleSize;
+
         @Override
         public void execute(NodeProbe probe)
         {
             PrintStream out = probe.output().out;
             PrintStream err = probe.output().err;
 
+            validateParameters(err, trainingMaxDictionarySize, 
trainingMaxTotalSampleSize);
+
             try
             {
                 out.printf("Starting compression dictionary training for 
%s.%s...%n", keyspace, table);
                 out.printf("Training from existing SSTables (flushing first if 
needed)%n");
 
-                probe.trainCompressionDictionary(keyspace, table, force);
+                Map<String, String> parameters = new HashMap<>();
+                if (trainingMaxTotalSampleSize != null)
+                    
parameters.put(TRAINING_MAX_DICTIONARY_SIZE_PARAMETER_NAME, 
trainingMaxDictionarySize);

Review Comment:
   The condition is seemingly wrong. It should be 
   ```
                   if (trainingMaxDictionarySize != null)
   ```



##########
doc/modules/cassandra/pages/managing/operating/compression.adoc:
##########
@@ -323,11 +317,34 @@ compression_dictionary_cache_expire: 3600
 
 # Automatic training
 compression_dictionary_training_auto_train_enabled: false
-compression_dictionary_training_sampling_rate: 100
-compression_dictionary_training_max_dictionary_size: 65536
-compression_dictionary_training_max_total_sample_size: 10485760
+compression_dictionary_training_sampling_rate: 0.01
 ----
 
+=== CQL training parameters:
+
+These parameters are meant to be configured via CQL for each respective table 
if defaults are not appropriate.
+
+* `training_max_total_sample_size` (default: `10MiB`): Maximum total size of 
sample data to collect for training, approximately 10MB. This is a parameter of 
`ZstdDictionaryCompressor`
+of a table, in `compression` section.

Review Comment:
   nit:
   
   ```suggestion
   * `training_max_total_sample_size` (default: `10MiB`): Maximum total size of 
sample data to collect for training, approximately 10MB. This parameter is 
configured in the 
   table's compression options for `ZstdDictionaryCompressor`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] CASSANDRA-21078 move training params to CQL [cassandra]

Reply via email to