[
https://issues.apache.org/jira/browse/SOLR-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sanjay Dutt updated SOLR-17497:
-------------------------------
Description:
Recently, a common exception (org.apache.lucene.store.AlreadyClosedException:
this Directory is closed) seen in multiple failed test cases.
FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
FAILED:
org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull
FAILED: org.apache.solr.cloud.TestPullReplica.testAddDocs
{code:java}
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught
exception in thread: Thread[id=10271, name=fsyncService-6341-thread-1,
state=RUNNABLE, group=TGRP-SplitShardWithNodeRoleTest]
at
__randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is
closed
at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
at
app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
at
app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
at
app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
at
app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
at
app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
at
app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:829)
{code}
Interesting thing about these test cases is that they all share same kind of
setup where each has one shard and two replicas – one NRT and another is PULL.
Going through one of the test case execution step.
FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
Test flow
1. Create a collection with 1 NRT and 1 PULL replica
2. waitForState
3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
4. Index another document.
5. waitForNumDocsInAllActiveReplicas(1);
6. Stop Pull replica
7. Index another document
8. waitForNumDocsInAllActiveReplicas(2);
9. Start Pull Replica
10. waitForState
11. waitForNumDocsInAllActiveReplicas(2);
As per the logs the whole sequence executed successfully. Here is the link to
the logs:
[https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
(link may stop working in the future)
Last step where they are making sure that all the active replicas should have
two documents each has logged a info which is another proof that it completed
successfully.
{code:java}
616575 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28])
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node3
(https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348)
[n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1
r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2
t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select
params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0
616607 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28])
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node4
(https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
has all 2 docs{code}
*Where is the issue then?*
In the logs it has been observed, that after restarting the PULL replica. The
recovery process started and after fetching all the files info from the NRT,
the replication aborted and logged "User aborted replication"
{code:java}
o.a.s.h.IndexFetcher User aborted Replication =>
org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted
replication at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted
replication{code}
Inside IndexFetcher once it's aborted, It performs cleanup() operation which do
the closeup. And delete the resource only if the downloaded bytes not equal to
the size.
{code:java}
private void cleanup() {
try {
file.close();
} catch (Exception e) {
/* no-op */
log.error("Error closing file: {}", this.saveAs, e);
}
if (bytesDownloaded != size) {
// if the download is not complete then
// delete the file being downloaded
try {
file.delete();
} catch (Exception e) {
log.error("Error deleting file: {}", this.saveAs, e);
}
// if the failure is due to a user abort it is returned normally else an
exception is thrown
if (!aborted)
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to
download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" +
size);
}
}{code}
After which a sync operation is performed in a thread, and that's where it
fails.
{code:java}
fsyncService.execute(
() -> {
try {
file.sync();
} catch (IOException e) {
fsyncException = e;
}
});{code}
Now two things:
1. Why would replication is aborted in the first place? And who executes it?
2. Should sync not be performed when the replication is aborted?
was:
Recently, a common exception (org.apache.lucene.store.AlreadyClosedException:
this Directory is closed) seen in multiple failed test cases.
FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
FAILED:
org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull
FAILED: org.apache.solr.cloud.TestPullReplica.testAddDocs
{code:java}
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught
exception in thread: Thread[id=10271, name=fsyncService-6341-thread-1,
state=RUNNABLE, group=TGRP-SplitShardWithNodeRoleTest]
at
__randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is
closed
at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
at
app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
at
app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
at
app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
at
app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
at
app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
at
app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:829)
{code}
Interesting thing about these test cases is that they all share same kind of
setup where each has one shard and two replicas – one NRT and another is PULL.
Going through one of the test case execution step.
FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
Test flow
1. Create a collection with 1 NRT and 1 PULL replica
2. waitForState
3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
4. Index another document.
5. waitForNumDocsInAllActiveReplicas(1);
6. Stop Pull replica
7. Index another document
8. waitForNumDocsInAllActiveReplicas(2);
9. Start Pull Replica
10. waitForState
11. waitForNumDocsInAllActiveReplicas(2);
As per the logs the whole sequence executed successfully. Here is the link to
the logs:
[https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
(link may stop working in the future)
Last step where they are making sure that all the active replicas should have
two documents each has logged a info which is another proof that it completed
successfully.
{code:java}
616575 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28])
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node3
(https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348)
[n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1
r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2
t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select
params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0
616607 INFO (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28])
[n: c: s: r: x: t:] o.a.s.c.TestPullReplica Replica core_node4
(https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
has all 2 docs{code}
*Where is the issue then?*
In the logs it has been observed, that after restarting the PULL replica. The
recovery process started and after fetching all the files info from the NRT,
the replication aborted and logged "User aborted replication"
{code:java}
o.a.s.h.IndexFetcher User aborted Replication =>
org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted
replication at
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User aborted
replication{code}
Inside IndexFetcher once it's aborted, It performs cleanup() operation which do
the closeup and delete the resource only if the operation is aborted.
{code:java}
private void cleanup() {
try {
file.close();
} catch (Exception e) {
/* no-op */
log.error("Error closing file: {}", this.saveAs, e);
}
if (bytesDownloaded != size) {
// if the download is not complete then
// delete the file being downloaded
try {
file.delete();
} catch (Exception e) {
log.error("Error deleting file: {}", this.saveAs, e);
}
// if the failure is due to a user abort it is returned normally else an
exception is thrown
if (!aborted)
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to
download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" +
size);
}
}{code}
After which a sync operation is performed in a thread, and that's where it
fails.
{code:java}
fsyncService.execute(
() -> {
try {
file.sync();
} catch (IOException e) {
fsyncException = e;
}
});{code}
Now two things:
1. Why would replication is aborted in the first place? And who executes it?
2. Should sync not be performed when the replication is aborted?
> Pull replicas throws AlreadyClosedException
> ---------------------------------------------
>
> Key: SOLR-17497
> URL: https://issues.apache.org/jira/browse/SOLR-17497
> Project: Solr
> Issue Type: Task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Sanjay Dutt
> Priority: Major
>
> Recently, a common exception (org.apache.lucene.store.AlreadyClosedException:
> this Directory is closed) seen in multiple failed test cases.
> FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
> FAILED:
> org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull
> FAILED: org.apache.solr.cloud.TestPullReplica.testAddDocs
>
>
> {code:java}
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
> uncaught exception in thread: Thread[id=10271,
> name=fsyncService-6341-thread-1, state=RUNNABLE,
> group=TGRP-SplitShardWithNodeRoleTest]
> at
> __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
> Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is
> closed
> at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
> at
> app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
> at
> app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
> at
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
> at
> app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
> at
> app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
> at
> app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
> at
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at [email protected]/java.lang.Thread.run(Thread.java:829)
> {code}
>
> Interesting thing about these test cases is that they all share same kind of
> setup where each has one shard and two replicas – one NRT and another is PULL.
>
> Going through one of the test case execution step.
> FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
>
> Test flow
> 1. Create a collection with 1 NRT and 1 PULL replica
> 2. waitForState
> 3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
> 4. Index another document.
> 5. waitForNumDocsInAllActiveReplicas(1);
> 6. Stop Pull replica
> 7. Index another document
> 8. waitForNumDocsInAllActiveReplicas(2);
> 9. Start Pull Replica
> 10. waitForState
> 11. waitForNumDocsInAllActiveReplicas(2);
>
> As per the logs the whole sequence executed successfully. Here is the link to
> the logs:
> [https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
> (link may stop working in the future)
>
> Last step where they are making sure that all the active replicas should have
> two documents each has logged a info which is another proof that it completed
> successfully.
>
> {code:java}
> 616575 INFO
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s:
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node3
> (https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
> has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348)
> [n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1
> r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2
> t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select
> params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0
> 616607 INFO
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s:
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node4
> (https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
> has all 2 docs{code}
>
> *Where is the issue then?*
> In the logs it has been observed, that after restarting the PULL replica. The
> recovery process started and after fetching all the files info from the NRT,
> the replication aborted and logged "User aborted replication"
>
> {code:java}
> o.a.s.h.IndexFetcher User aborted Replication =>
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User
> aborted replication at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User
> aborted replication{code}
>
> Inside IndexFetcher once it's aborted, It performs cleanup() operation which
> do the closeup. And delete the resource only if the downloaded bytes not
> equal to the size.
> {code:java}
> private void cleanup() {
> try {
> file.close();
> } catch (Exception e) {
> /* no-op */
> log.error("Error closing file: {}", this.saveAs, e);
> }
> if (bytesDownloaded != size) {
> // if the download is not complete then
> // delete the file being downloaded
> try {
> file.delete();
> } catch (Exception e) {
> log.error("Error deleting file: {}", this.saveAs, e);
> }
> // if the failure is due to a user abort it is returned normally else an
> exception is thrown
> if (!aborted)
> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to
> download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" +
> size);
> }
> }{code}
> After which a sync operation is performed in a thread, and that's where it
> fails.
> {code:java}
> fsyncService.execute(
> () -> {
> try {
> file.sync();
> } catch (IOException e) {
> fsyncException = e;
> }
> });{code}
> Now two things:
> 1. Why would replication is aborted in the first place? And who executes it?
> 2. Should sync not be performed when the replication is aborted?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]