[
https://issues.apache.org/jira/browse/SOLR-13470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839807#comment-16839807
]
Hoss Man edited comment on SOLR-13470 at 5/14/19 9:25 PM:
----------------------------------------------------------
Here's an example of how things can go wrong with the attached patch that
hardens up the asserts regarding the expected exception...
{noformat}
...
[junit4] 2> 12780 WARN (explicit-fetchindex-cmd) [ ]
o.a.s.h.IndexFetcher File segments_2 did not match. expected checksum is
1298216336 and actual is checksum 1205232915. expected length is 252 and actual
length is 252
[junit4] 2> 12780 INFO (explicit-fetchindex-cmd) [ ]
o.a.s.h.IndexFetcher This disk does not have enough space to download the index
from leader/master. So cleaning up the local index. This may lead to loss of
data/or node if index replication fails in between
[junit4] 2> 12785 INFO (Thread-19) [ ] o.a.s.c.s.i.HttpSolrClient
nocommit: remote ex (403) w/o error? ...
{responseHeader={status=0,QTime=39},response={numFound=1200,start=0,docs=[]}}
[junit4] 2> 12794 ERROR (Thread-19) [ ]
o.a.s.h.TestReplicationHandlerDiskOverFlow Query Thread Failure
[junit4] 2> =>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:46668/solr/collection1: Forbidden
[junit4] 2>
[junit4] 2>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:46668/solr/collection1: Forbidden
[junit4] 2>
[junit4] 2> request:
http://127.0.0.1:46668/solr/collection1/select?q=*:*&rows=0&wt=javabin&version=2
[junit4] 2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:650)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:207)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:987) ~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1002) ~[java/:?]
[junit4] 2> at
org.apache.solr.handler.TestReplicationHandlerDiskOverFlow.lambda$testDiskOverFlow$1(TestReplicationHandlerDiskOverFlow.java:154)
~[test/:?]
[junit4] 2> at java.lang.Thread.run(Thread.java:834) [?:?]
...
[junit4] HEARTBEAT J0 PID(26623@tray): 2019-05-14T14:01:33, stalled for
69.1s at: TestReplicationHandlerDiskOverFlow.testDiskOverFlow
[junit4] 2> 72012 INFO (indexFetcher-27-thread-1) [ ]
o.a.s.h.ReplicationHandler Poll disabled
[junit4] 2> 72780 ERROR (explicit-fetchindex-cmd) [ ]
o.a.s.h.TestReplicationHandlerDiskOverFlow IndexFetcher Thread Failure
[junit4] 2> => java.util.concurrent.TimeoutException
[junit4] 2> at
java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:258)
[junit4] 2> java.util.concurrent.TimeoutException: null
[junit4] 2> at
java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:258) ~[?:?]
[junit4] 2> at
java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:436) ~[?:?]
[junit4] 2> at
org.apache.solr.handler.TestReplicationHandlerDiskOverFlow$2.getAsBoolean(TestReplicationHandlerDiskOverFlow.java:137)
~[test/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.deleteFilesInAdvance(IndexFetcher.java:1133)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1033)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:575)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:351)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:422)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.ReplicationHandler.lambda$fetchIndex$0(ReplicationHandler.java:345)
~[java/:?]
[junit4] 2> at java.lang.Thread.run(Thread.java:834) [?:?]
...
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestReplicationHandlerDiskOverFlow -Dtests.method=testDiskOverFlow
-Dtests.seed=7A25E0BCF2F3D8D4 -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=kok -Dtests.timezone=MET -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] FAILURE 70.4s | TestReplicationHandlerDiskOverFlow.testDiskOverFlow
<<<
[junit4] > Throwable #1: java.lang.AssertionError: threads encountered
failures (see logs for when) expected:<[]> but
was:<[org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://127.0.0.1:46668/solr/collection1: Forbidden
[junit4] > request:
http://127.0.0.1:46668/solr/collection1/select?q=*:*&rows=0&wt=javabin&version=2,
java.util.concurrent.TimeoutException]>
[junit4] > at
__randomizedtesting.SeedInfo.seed([7A25E0BCF2F3D8D4:B017476C918DD2BB]:0)
[junit4] > at
org.apache.solr.handler.TestReplicationHandlerDiskOverFlow.testDiskOverFlow(TestReplicationHandlerDiskOverFlow.java:189)
[junit4] > at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit4] > at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[junit4] > at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[junit4] > at
java.base/java.lang.reflect.Method.invoke(Method.java:566)
[junit4] > at java.base/java.lang.Thread.run(Thread.java:834)
[junit4] 2> NOTE: leaving temporary files on disk at:
/home/hossman/lucene/dev/solr/build/solr-core/test/J0/temp/solr.handler.TestReplicationHandlerDiskOverFlow_7A25E0BCF2F3D8D4-001
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene80):
{name=BlockTreeOrds(blocksize=128), id=BlockTreeOrds(blocksize=128)},
docValues:{}, maxPointsInLeafNode=1340, maxMBSortInHeap=5.7037203451015515,
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@57ee8eda),
locale=kok, timezone=MET
[junit4] 2> NOTE: Linux 3.19.0-84-generic amd64/Oracle Corporation 11.0.2
(64-bit)/cpus=4,threads=2,free=205025280,total=260046848
[junit4] 2> NOTE: All tests run in this JVM:
[TestReplicationHandlerDiskOverFlow]
[junit4] Completed [1/1 (1!)] in 72.25s, 1 test, 1 failure <<< FAILURES!
{noformat}
...note that the added logging in HttpSolrClient shows that we're evidently
getting a valid search result w/a 403 response code (did the query somehow
already run successfully before the server realized it should return 403?)
But re-running this same seed multiple times will frequently pass (ie: the
exception message makes it all the way to the client)
(Jira comment edited to fix mangled whitespace)
was (Author: hossman):
Here's an example of how things can go wrong with the attached patch that
hardens up the asserts regarding the expected exception...
{noformat}
...
[junit4] 2> 12780 WARN (explicit-fetchindex-cmd) [ ] o.a.s.h.IndexFetcher File
segments_2 did not match. expected checksum is 1298216336 and actual is
checksum 1205232915. expected length is 252 and actual length is 252
[junit4] 2> 12780 INFO (explicit-fetchindex-cmd) [ ] o.a.s.h.IndexFetcher This
disk does not have enough space to download the index from leader/master. So
cleaning up the local index. This may lead to loss of data/or node if index
replication fails in between
[junit4] 2> 12785 INFO (Thread-19) [ ] o.a.s.c.s.i.HttpSolrClient nocommit:
remote ex (403) w/o error? ...
{responseHeader={status=0,QTime=39},response={numFound=1200,start=0,docs=[]}}
[junit4] 2> 12794 ERROR (Thread-19) [ ]
o.a.s.h.TestReplicationHandlerDiskOverFlow Query Thread Failure
[junit4] 2> =>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:46668/solr/collection1: Forbidden
[junit4] 2>
[junit4] 2>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:46668/solr/collection1: Forbidden
[junit4] 2>
[junit4] 2> request:
http://127.0.0.1:46668/solr/collection1/select?q=*:*&rows=0&wt=javabin&version=2
[junit4] 2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:650)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:207)
~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:987) ~[java/:?]
[junit4] 2> at
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1002) ~[java/:?]
[junit4] 2> at
org.apache.solr.handler.TestReplicationHandlerDiskOverFlow.lambda$testDiskOverFlow$1(TestReplicationHandlerDiskOverFlow.java:154)
~[test/:?]
[junit4] 2> at java.lang.Thread.run(Thread.java:834) [?:?]
...
[junit4] HEARTBEAT J0 PID(26623@tray): 2019-05-14T14:01:33, stalled for 69.1s
at: TestReplicationHandlerDiskOverFlow.testDiskOverFlow
[junit4] 2> 72012 INFO (indexFetcher-27-thread-1) [ ]
o.a.s.h.ReplicationHandler Poll disabled
[junit4] 2> 72780 ERROR (explicit-fetchindex-cmd) [ ]
o.a.s.h.TestReplicationHandlerDiskOverFlow IndexFetcher Thread Failure
[junit4] 2> => java.util.concurrent.TimeoutException
[junit4] 2> at
java.base/java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:258)
[junit4] 2> java.util.concurrent.TimeoutException: null
[junit4] 2> at
java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:258) ~[?:?]
[junit4] 2> at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:436)
~[?:?]
[junit4] 2> at
org.apache.solr.handler.TestReplicationHandlerDiskOverFlow$2.getAsBoolean(TestReplicationHandlerDiskOverFlow.java:137)
~[test/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.deleteFilesInAdvance(IndexFetcher.java:1133)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1033)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:575)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:351)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:422)
~[java/:?]
[junit4] 2> at
org.apache.solr.handler.ReplicationHandler.lambda$fetchIndex$0(ReplicationHandler.java:345)
~[java/:?]
[junit4] 2> at java.lang.Thread.run(Thread.java:834) [?:?]
...
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestReplicationHandlerDiskOverFlow -Dtests.method=testDiskOverFlow
-Dtests.seed=7A25E0BCF2F3D8D4 -Dtests.slow=true -Dtests.badapples=true
-Dtests.locale=kok -Dtests.timezone=MET -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] FAILURE 70.4s | TestReplicationHandlerDiskOverFlow.testDiskOverFlow <<<
[junit4] > Throwable #1: java.lang.AssertionError: threads encountered failures
(see logs for when) expected:<[]> but
was:<[org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://127.0.0.1:46668/solr/collection1: Forbidden
[junit4] > request:
http://127.0.0.1:46668/solr/collection1/select?q=*:*&rows=0&wt=javabin&version=2,
java.util.concurrent.TimeoutException]>
[junit4] > at
__randomizedtesting.SeedInfo.seed([7A25E0BCF2F3D8D4:B017476C918DD2BB]:0)
[junit4] > at
org.apache.solr.handler.TestReplicationHandlerDiskOverFlow.testDiskOverFlow(TestReplicationHandlerDiskOverFlow.java:189)
[junit4] > at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit4] > at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[junit4] > at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[junit4] > at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[junit4] > at java.base/java.lang.Thread.run(Thread.java:834)
[junit4] 2> NOTE: leaving temporary files on disk at:
/home/hossman/lucene/dev/solr/build/solr-core/test/J0/temp/solr.handler.TestReplicationHandlerDiskOverFlow_7A25E0BCF2F3D8D4-001
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene80):
{name=BlockTreeOrds(blocksize=128), id=BlockTreeOrds(blocksize=128)},
docValues:{}, maxPointsInLeafNode=1340, maxMBSortInHeap=5.7037203451015515,
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@57ee8eda),
locale=kok, timezone=MET
[junit4] 2> NOTE: Linux 3.19.0-84-generic amd64/Oracle Corporation 11.0.2
(64-bit)/cpus=4,threads=2,free=205025280,total=260046848
[junit4] 2> NOTE: All tests run in this JVM:
[TestReplicationHandlerDiskOverFlow]
[junit4] Completed [1/1 (1!)] in 72.25s, 1 test, 1 failure <<< FAILURES!
{noformat}
...note that the added logging in HttpSolrClient shows that we're evidently
getting a valid search result w/a 403 response code (did the query somehow
already run successfully before the server realized it should return 403?)
But re-running this same seed multiple times will frequently pass (ie: the
exception message makes it all the way to the client)
> SolrException msg not always propogated to HttpClient (may be specific to
> SOLR-12999 ?)
> ---------------------------------------------------------------------------------------
>
> Key: SOLR-13470
> URL: https://issues.apache.org/jira/browse/SOLR-13470
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: SOLR-13470.patch
>
>
> While working on some test hardening for SOLR-12999, I discovered a strange
> bug related to how SolrExceptions are propogated to HttpClients -- sometimes
> the message set by the server side code when throwing the SolrException is
> set in the remote exception recieved by the HttpSolrClient, other times it is
> not.
> it's not clear to me if this is specific to the IndexFetcher related code
> (added in SOLR-12999) that throw SolrExceptions when the index is in the
> middle of a full copy, or if it's a general problem that can happen with any
> SolrException->HTTP->RemoteSolrException via HttpSolrClient that only happens
> to manifests because of some quirk in the threading of
> TestReplicationHandlerDiskOverFlow.
> (perhaps because we don't have a lot of HTTP level tests checking the
> exception message?)
> At the moment, TestReplicationHandlerDiskOverFlow works around this issue by
> only comparing the HTTP Staus code to ensure it's what's expected, w/o
> checking the getMessage() ... I'll attach a patch that demonstrates how
> including a getMessage() assertion can (sporadically) fail, and includes some
> nocommit debugging code i added to HttpSolrClient to try and make sense of
> what's happening...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]