[
https://issues.apache.org/jira/browse/SOLR-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892798#comment-17892798
]
Sanjay Dutt commented on SOLR-17497:
------------------------------------
Yeah you are right, I have to look into this subject (execute vs submit) more
and how this whole things works.
+1 to this.
{quote}For the case here in IndexFetcher, as long as the exception that is
thrown is logged, I think we should suppress its propagation further.
{quote}
I was also looking why we are getting "User aborted replication" messages.
RecoveryStrategy in case of PULL replicas cancel the replication. Here is the
explanation from the old JIRA.
https://issues.apache.org/jira/browse/SOLR-10233
{quote}
h3. Passive replica dies (or is unreachable)
Replica won’t be query-able. On restart, replica will recover from the leader,
following the same flow as _realtime_ replicas: set state to DOWN, then
RECOVERING, and finally ACTIVE. _Passive_ replicas will use a different
{{RecoveryStrategy}} implementation, that omits *preparerecovery,* and peer
sync attempt, it will jump to replication . If the leader didn't change, or if
the other replicas are of type “append”, replication should be incremental.
Once the first replication is done, passive replica will declare itself active
and start serving traffic.
{quote}
*RecoveryStrategy.java*
{noformat}
log.info("Stopping background replicate from leader process");
zkController.stopReplicationFromLeader(coreName);
replicate(zkController.getNodeName(), core, leaderprops);{noformat}
My own theory:
# RecoveryStrategy cancel replication.
# FileFetcher#fetchPackets throws ReplicationHandlerException
{code:java}
if (stop) {
stop = false;
aborted = true;
throw new ReplicationHandlerException("User aborted replication");
}{code}
# FileFetcher#fetch runs finally block where the sync is executed in async
{code:java}
fsyncService.submit(() -> {
try {
file.sync();
} catch (IOException e) {
fsyncException = e;
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
});{code}
# At the same time the control gets back to fetchLatestIndex that performs
cleanup and closed the directory
{code:java}
finally {
if (!cleanupDone) {
cleanup(solrCore, tmpIndexDir, indexDir, deleteTmpIdxDir, tmpTlogDir,
successfulInstall);
}
}{code}
And basically there is race condition between step 3 and 4 that's what I
believe. Not able to reproduce on my system yet.
> Pull replicas throws AlreadyClosedException
> ---------------------------------------------
>
> Key: SOLR-17497
> URL: https://issues.apache.org/jira/browse/SOLR-17497
> Project: Solr
> Issue Type: Task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Sanjay Dutt
> Priority: Major
> Attachments: Screenshot 2024-10-23 at 6.01.02 PM.png
>
>
> Recently, a common exception (org.apache.lucene.store.AlreadyClosedException:
> this Directory is closed) seen in multiple failed test cases.
> FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
> FAILED:
> org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull
> FAILED: org.apache.solr.cloud.TestPullReplica.testAddDocs
>
>
> {code:java}
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
> uncaught exception in thread: Thread[id=10271,
> name=fsyncService-6341-thread-1, state=RUNNABLE,
> group=TGRP-SplitShardWithNodeRoleTest]
> at
> __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
> Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is
> closed
> at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
> at
> app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
> at
> app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
> at
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
> at
> app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
> at
> app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
> at
> app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
> at
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at [email protected]/java.lang.Thread.run(Thread.java:829)
> {code}
>
> Interesting thing about these test cases is that they all share same kind of
> setup where each has one shard and two replicas – one NRT and another is PULL.
>
> Going through one of the test case execution step.
> FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
>
> Test flow
> 1. Create a collection with 1 NRT and 1 PULL replica
> 2. waitForState
> 3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
> 4. Index another document.
> 5. waitForNumDocsInAllActiveReplicas(1);
> 6. Stop Pull replica
> 7. Index another document
> 8. waitForNumDocsInAllActiveReplicas(2);
> 9. Start Pull Replica
> 10. waitForState
> 11. waitForNumDocsInAllActiveReplicas(2);
>
> As per the logs the whole sequence executed successfully. Here is the link to
> the logs:
> [https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
> (link may stop working in the future)
>
> Last step where they are making sure that all the active replicas should have
> two documents each has logged a info which is another proof that it completed
> successfully.
>
> {code:java}
> 616575 INFO
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s:
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node3
> (https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
> has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348)
> [n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1
> r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2
> t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select
> params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0
> 616607 INFO
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s:
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node4
> (https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
> has all 2 docs{code}
>
> *Where is the issue then?*
> In the logs it has been observed, that after restarting the PULL replica. The
> recovery process started and after fetching all the files info from the NRT,
> the replication aborted and logged "User aborted replication"
>
> {code:java}
> o.a.s.h.IndexFetcher User aborted Replication =>
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User
> aborted replication at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User
> aborted replication{code}
>
> Inside IndexFetcher once it's aborted, It performs cleanup() operation which
> do the closeup. And delete the resource only if the downloaded bytes not
> equal to the size.
> {code:java}
> private void cleanup() {
> try {
> file.close();
> } catch (Exception e) {
> /* no-op */
> log.error("Error closing file: {}", this.saveAs, e);
> }
> if (bytesDownloaded != size) {
> // if the download is not complete then
> // delete the file being downloaded
> try {
> file.delete();
> } catch (Exception e) {
> log.error("Error deleting file: {}", this.saveAs, e);
> }
> // if the failure is due to a user abort it is returned normally else an
> exception is thrown
> if (!aborted)
> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to
> download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" +
> size);
> }
> }{code}
> After which a sync operation is performed in a thread, and that's where it
> fails.
> {code:java}
> fsyncService.execute(
> () -> {
> try {
> file.sync();
> } catch (IOException e) {
> fsyncException = e;
> }
> });{code}
> Now two things:
> 1. Why would replication is aborted in the first place? And who executes it?
> 2. Should sync not be performed when the replication is aborted?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]