[
https://issues.apache.org/jira/browse/CASSANDRA-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524978#comment-17524978
]
Stefan Miklosovic commented on CASSANDRA-17180:
-----------------------------------------------
After spending more time on this, I identified an issue I am not sure how to
solve. I have not detected this by my unit tests because I was, more or less,
mocking it but once I actually tried it on the running node, to my surprise it
was not detecting the tables which should be causing violations.
The reason this is happening is that when I am about to read the gc grace
period from the table, I firstly iterate over all keyspaces like this:
{code}
Schema.instance.getUserKeyspaces()
{code}
and then on each such returned keyspace, I iterate over its tables, reading gc
grace period parameter.
The problem with this approach is that Schema.instance.getUserKeyspaces()
returns no keyspaces. The reason why it is happening is that by the time
startup checks are executed, it is way too early for this stuff to be fully
populated yet.
Interestingly enough, there is another check later, called
"checkSystemKeyspaceState" which is using this, which is not empty.
"Schema.instance.getTablesAndViews(SchemaConstants.SYSTEM_KEYSPACE_NAME)"
This internally translates to:
{code}
public Iterable<TableMetadata> getTablesAndViews(String keyspaceName)
{
Preconditions.checkNotNull(keyspaceName);
KeyspaceMetadata ksm = ObjectUtils.getFirstNonNull(() ->
distributedKeyspaces.getNullable(keyspaceName),
() ->
localKeyspaces.getNullable(keyspaceName));
Preconditions.checkNotNull(ksm, "Keyspace %s not found", keyspaceName);
return ksm.tablesAndViews();
}
{code}
So here, localKeyspaces are populated, but distributedKeyspaces are not. The
difference is that local keyspaces are populated directly in the constructor of
Schema like
{code}
private Schema()
{
this.online = isDaemonInitialized();
this.localKeyspaces = (FORCE_LOAD_LOCAL_KEYSPACES ||
isDaemonInitialized() || isToolInitialized())
? Keyspaces.of(SchemaKeyspace.metadata(),
SystemKeyspace.metadata())
: Keyspaces.none();
....
{code}
on the other hand, distributed keyspaces I am interested in seem to be
populated as the result of schema migrations which happen later after all
checks are run.
Other approaches like calling "Keyspace.nonSystem" or similar translate
internally to same "empty distributedKeyspaces" problem. I could manually parse
what is on the disk but then I am not sure how to get table metadata from it
and this approach does not seem robust enough anyway.
[~brandon.williams] [~paulo] do you have any idea how to get metadata
information so early in the booting sequence. If rendered impossible, I think
we have to completely abandon statup check approach and do it after schema is
initialised fully, but on the other hand, we can not do it too late to still be
able to somehow prevent the node from running.
> Implement startup check to prevent Cassandra start to spread zombie data
> ------------------------------------------------------------------------
>
> Key: CASSANDRA-17180
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17180
> Project: Cassandra
> Issue Type: New Feature
> Components: Legacy/Observability
> Reporter: Stefan Miklosovic
> Assignee: Stefan Miklosovic
> Priority: Normal
> Time Spent: 9.5h
> Remaining Estimate: 0h
>
> As already discussed on ML, it would be nice to have a service which would
> periodically write timestamp to a file signalling it is up / running.
> Then, on the startup, we would read this file and we would determine if there
> is some table which gc grace is behind this time and we would fail the start
> so we would prevent zombie data to be likely spread around a cluster.
> https://lists.apache.org/thread/w4w5t2hlcrvqhgdwww61hgg58qz13glw
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]