[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-06 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785941#comment-16785941
 ] 

Hoss Man commented on LUCENE-8692:
--

{quote}I think we need to be careful here. From my perspective there are 3 
types of exceptions here:
{quote}
Right  ok. This is definitely more nuanced then I originally realized, I 
was lumping #2 and #3 into the same bucket in my head, and hadn't even 
remembered/considered the "rollback" situation.

FWIW: My "mental model" when I first raised this issue was:
 * We've got this MockDirectoryWrapper.corruptFiles helper method that corrupts 
files under the covers in ways that are unrecoverable from a client perspective
 * Presumably then the exceptions a client gets as a result should be "tragic"
 * When a test uses corruptFiles, here are the code paths that can result in 
exceptions bubbling up to a client that are not marked tragic

But your (IIUC) point is that in the general case, there is no reason some of 
these code paths (like maybeMerge & startCommit) can/should assume that _any_ 
exceptions bubbling up are in fact unrecoverable / traggic – in this test it 
may be, but the code doesn't know that: it's very possible that a rollback will 
work.

{quote}Now we can debate if we want to change this and we can, in-fact I am all 
for making it even more strict especially since it's inconsistent with what we 
do if addDocument fails with an aborting exception.
{quote}
It definitely seems like there should be _something_ we can/should do to better 
recognize situations like this as "unrecoverable" and be more strict in dealing 
with low level exceptions during things like commit – but I'm out definitely 
out of my depth in understanding/suggesting what that might look like.

Clearly the current patch is being too aggressive in what it treats as "tragic".

> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-06 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785625#comment-16785625
 ] 

Simon Willnauer commented on LUCENE-8692:
-

{quote}
For now I've updated the patch to take the simplest possible approach to 
checking for MergeAbortedException
{quote}


+1

{quote}
Well, to flip your question around: is there an example of a Throwable you can 
think of bubbling up out of IndexWriter.startCommit() that should NOT be 
considered fatal?
{quote}
I think we need to be careful here. From my perspective there are 3 types of 
exceptions here:
 * unrecoverable exceptions aka. VirtualMachineErrors
 * exceptions that happen during indexing and are not recoverable (these are 
handled in DocumentsWriter)
 * exceptions that cause dataloss or inconsistencies (we didn't handle those as 
fatal yet at least not consistently) but we only catch VirtualMachineError.

Those are in particular:

 * getReader()
 * deleteAll()
 * addIndexes()
 * flushNextBuffer()
 * prepareCommitInternal() 
 * doFlush()
 * startCommit()

Those methods might cause documents go missing etc. but we treated them not as 
fatal or tragic events since a user could always call rollback() to go back the 
the last known safe-point / previous commit. Now we can debate if we want to 
change this and we can, in-fact I am all for making it even more strict 
especially since it's inconsistent with what we do if addDocument fails with an 
aborting exception. 
If we do that we need to see if rollback still has a purpose and maybe remove 
it?

now speaking of maybeMerge I don't see why we need to close the index writer 
with a tragic event, there is no dataloss nor an inconsistency? From that logic 
I don't think we need to handle these exceptions in such a drastic way?

{quote}
I don't use github for lucene development – I track all contributions as 
patches in the official issue tracker for the project as recommended by our 
official guidelines : )  ... but i'll go ahead and create a jira/LUCENE-8692 
branch if that will help you review.
{quote}

Bummer, I am not sure branches help. Working like it's still 1999 is a pain we 
should fix our guidelines.



> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-05 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785006#comment-16785006
 ] 

ASF subversion and git services commented on LUCENE-8692:
-

Commit 4bd7d6810dbf50ae8f07c1051378f482ff953073 in lucene-solr's branch 
refs/heads/jira/LUCENE-8692 from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4bd7d68 ]

LUCENE-8692: initial attempt at fixing bug


> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-05 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785005#comment-16785005
 ] 

ASF subversion and git services commented on LUCENE-8692:
-

Commit 2b28d9d9d0cdc35dfdc202f23a9ef1d27b2a6083 in lucene-solr's branch 
refs/heads/jira/LUCENE-8692 from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2b28d9d ]

LUCENE-8692: add additional logging to Solr test that first surfaced bug


> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-05 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785004#comment-16785004
 ] 

ASF subversion and git services commented on LUCENE-8692:
-

Commit af9da98e87ee2a44532cd9baf2a1bc6e64865b0d in lucene-solr's branch 
refs/heads/jira/LUCENE-8692 from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=af9da98 ]

LUCENE-8692: updates to TestStressIndexing2 demonstrating bug


> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-05 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785007#comment-16785007
 ] 

ASF subversion and git services commented on LUCENE-8692:
-

Commit e93313d42c4e5642e58707b74ed119776086b590 in lucene-solr's branch 
refs/heads/jira/LUCENE-8692 from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e93313d ]

LUCENE-8692: fix file leak in MockRandomPostingsFormat that surfaces when 
corrupting files


> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-05 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784997#comment-16784997
 ] 

Hoss Man commented on LUCENE-8692:
--

bq. (see also my nocommit comments about the existing tragicEvent() call in 
prepareCommitInternal() ... but that hadn't triggered any failures in the test 
so I hadn't touched it)

I spoke too soon -- beasting just turned up this interesting little situation...

{noformat}
hossman@tray:~/lucene/dev/lucene/core [master] $ ant beast -Dbeast.iters=100 
-Dtests.iters=100 -Dtestcase=TestStressIndexing2 
-Dtests.method=testRandomCorruptionIsTragic\*
...
 [beaster] Beast round 34 results: 
/home/hossman/lucene/dev/lucene/build/core/test/34
  [beaster] The following error occurred while executing this line:
  [beaster] /home/hossman/lucene/dev/lucene/common-build.xml:1572: The 
following error occurred while executing this line:
  [beaster] /home/hossman/lucene/dev/lucene/common-build.xml:1099: There were 
test failures: 1 suite, 100 tests, 1 failure [seed: CABE666E4674CFB2]
  [beaster] Executing 1 suite with 1 JVM.
  [beaster] 
  [beaster] Started J0 PID(10111@localhost).
  [beaster]   2> NOTE: reproduce with: ant test  -Dtestcase=TestStressIndexing2 
-Dtests.method=testRandomCorruptionIsTragic -Dtests.seed=CABE666E4674CFB2 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=cs 
-Dtests.timezone=America/Nipigon -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
  [beaster] [15:50:16.736] FAILURE 0.02s | 
TestStressIndexing2.testRandomCorruptionIsTragic 
{seed=[CABE666E4674CFB2:682DC0F2BA2A235F]} <<<
  [beaster]> Throwable #1: java.lang.AssertionError: index update 
encountered throwable, but no tragic event recorded: java.lang.AssertionError
  [beaster]>at 
__randomizedtesting.SeedInfo.seed([CABE666E4674CFB2:682DC0F2BA2A235F]:0)
  [beaster]>at org.junit.Assert.fail(Assert.java:88)
  [beaster]>at org.junit.Assert.assertTrue(Assert.java:41)
  [beaster]>at org.junit.Assert.assertNotNull(Assert.java:712)
  [beaster]>at 
org.apache.lucene.index.TestStressIndexing2$CorruptibleIndexingThread.run(TestStressIndexing2.java:1019)
  [beaster]>at 
org.apache.lucene.index.TestStressIndexing2.testRandomCorruptionIsTragic(TestStressIndexing2.java:144)
  [beaster]>at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown 
Source)
  [beaster]>at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  [beaster]>at java.lang.reflect.Method.invoke(Method.java:498)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  [beaster]>at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  [beaster]>at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  [beaster]>at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  [beaster]>at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  [beaster]>at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  [beaster]>at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]>at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  [beaster]>at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  [beaster]>at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  [beaster]>at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  [beaster]>at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  [beaster]>at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  [beaster]>at 

[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-05 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784438#comment-16784438
 ] 

Simon Willnauer commented on LUCENE-8692:
-



{noformat}
I think there is an issue with the patch with MergeAbortedExeption indeed given 
that registerMerge might throw such an exception. Maybe we should move this try 
block to registerMerge instead where we know which OneMerge is being registered 
(and is also where the exception is thrown when estimating the size of the 
merge).
{noformat}

+1

{code:java}
-} catch (VirtualMachineError tragedy) {
+} catch (Throwable tragedy) {
   tragicEvent(tragedy, "startCommit");
{code}

I am not sure why we need to treat every exception as fatal in this case?

I also wonder if we could move this to a PR on github, iterations would be 
simpler and comments too. I can't tell which patch is relevant which one isn't.

> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-03-04 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783151#comment-16783151
 ] 

Adrien Grand commented on LUCENE-8692:
--

bq. the new test seems to have also uncovered some issues with IndexWriter 
leaking files in some tragic event situations

This one is a bug in {{MockRandomPostingsFormat#fieldsProducer}}. It should use 
try-with-resources when opening {{in}} to make sure it eventually gets closed.

I think there is an issue with the patch with MergeAbortedExeption indeed given 
that {{registerMerge}} might throw such an exception. Maybe we should move this 
try block to {{registerMerge}} instead where we know which OneMerge is being 
registered (and is also where the exception is thrown when estimating the size 
of the merge).

[~mikemccand] [~rcmuir] [~simonw] You might want to have a look at this patch 
since you are more familiar with tragedy handling than I am.




> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> -
>
> Key: LUCENE-8692
> URL: https://issues.apache.org/jira/browse/LUCENE-8692
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> 
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)

2019-02-12 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766039#comment-16766039
 ] 

Adrien Grand commented on LUCENE-8692:
--

I had a quick look and every time the test failed for me it was in the 
maybeMerge call that is triggered by the commit() call, eg.

{noformat}
java.nio.file.NoSuchFileException: _0.fdx
at 
org.apache.lucene.store.ByteBuffersDirectory.fileLength(ByteBuffersDirectory.java:155)
at 
org.apache.lucene.store.MockDirectoryWrapper.fileLength(MockDirectoryWrapper.java:1009)
at 
org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
at 
org.apache.lucene.index.IndexWriter.registerMerge(IndexWriter.java:4176)
at 
org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2197)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2154)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3455)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3407)
at 
org.apache.lucene.index.TestIndexWriterExceptions.testRandomCorruptionIsTragic(TestIndexWriterExceptions.java:2136)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.lang.Thread.run(Thread.java:748)
{noformat}