Tódor Dávid Nyeste created SOLR-16888:
-----------------------------------------
Summary: Data corruption causes snapshots with previously used
names to be unable to be created/deleted
Key: SOLR-16888
URL: https://issues.apache.org/jira/browse/SOLR-16888
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 8.4.1
Reporter: Tódor Dávid Nyeste
Solr snapshot metadata is stored on the fs, and zookeeper stores information
about those as well. If for some reason zookeeper's data gets corrupted that
causes the snapshot to desync. No further snapshots can be created with the
same name, since
[CreateSnapshotCmd|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.4.1/solr/core/src/java/org/apache/solr/cloud/api/collections/CreateSnapshotCmd.java]
begins the creation of a new zookeeper entry, but it fails at the
[SolrSnapshotMetaDataManager|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.4.1/solr/core/src/java/org/apache/solr/core/snapshots/SolrSnapshotMetaDataManager.java#L178]'s
snapshot() function, since the fs already contains a snapshot with this name,
and the nameToDetailsMapping map obtains it's data from there. Attempting to
delete this snapshot with
[DeleteSnapshotCmd|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.4.1/solr/core/src/java/org/apache/solr/cloud/api/collections/DeleteSnapshotCmd.java]
successfully removes the zookeeper entry, however since it's a failed snapshot
with no cores assigned it's unable to remove the metadata on the fs (but still
returns a misleading "successfully deleted" message).
A possible workaround is to manually delete the metadata on the fs, but if
there are more than one snapshots for the collection then editing the binary
file to remove the corrupt entry becomes complicated. Would adding a "force
delete" command, which scanned all the cores assigned to the collection for
snapshot metadata with the given name, and delete them, be feasible? Or
alternatively, would it be possible to improve the logging to represent the
situation more accurately? For example, giving a warning at deletion that the
snapshot is a failed one with no cores assigned to it, therefore it won't
delete any fs metadata.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]