maedhroz commented on code in PR #2267:
URL: https://github.com/apache/cassandra/pull/2267#discussion_r1162025086
##########
conf/cassandra.yaml:
##########
@@ -1924,8 +1924,14 @@ drop_compact_storage_enabled: false
# which is used in some serialization to denote the format type in a compact
way (such as local key cache); and 'name'
# which will be used to recognize the format type - in particular that name
will be used in sstable file names and in
# stream headers so the name has to be the same for the same format across all
the nodes in the cluster.
+# The first entry in this list is the format that will be used for
newly-created SSTables. The other formats given
+# will be used to read any SSTables present in the data directories or
streamed.
sstable_formats:
- class_name: org.apache.cassandra.io.sstable.format.big.BigFormat
parameters:
id: 0
name: big
+ - class_name: org.apache.cassandra.io.sstable.format.bti.BtiFormat
+ parameters:
+ id: 1
+ name: bti
Review Comment:
I wasn't able to look at the original SSTable format API configuration work,
but there are a few things that worry me a bit:
1.) Hard to know without actually writing the code, but it feels like some
things would have been a bit simpler if we avoided `ParameterizedClass` and
just used a `EncryptionOptions`-eqsue configuration object to contain the
structure and validation logic for the formats.
2.) There are a couple things here that feel very dangerous to expose to an
operator via local configuration, with both local and cluster-wide implications.
The first is having the `id` concept specified in config. If we start a
node, write some SSTables, bring the node down, change the id of the primary
format, then start the node again, I think we can break things like
`KeyCacheSerializer#deserialize()`, which have expectations around previously
written `id`s and how they map to formats. The ID concept feels like it should
be statically and globally defined by the format itself. If you create a custom
format, it should simply avoid conflicting w/ built-in `id`s/`name`s. (This may
become a distributed problem around streaming too through a local configuration
change on another node?)
Second, while specifying a primary format (to write new SSTables) is
necessary, allowing that to be determined by physical order in the YAML should
IMO be avoided. It isn't a catastrophic risk, but one where you could silently
write the wrong format for an accidental ordering mistake. The selection of a
primary format should be explicit/not be a mystery in the absence of inline
comments.
Last (and I feel like we had this discussion around the Memtable API a while
back), having to specify anything about the **built-in** formats in YAML space
feels unnecessary. We'd still need something like this to specify custom
formats, etc.
I'm not pushing for a [Lucene-style SPI
approach](https://www.elastic.co/blog/what-is-an-apache-lucene-codec) or
anything in particular, but we need to at least discuss the above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]