[jira] [Reopened] (SOLR-13622) Add FileStream Streaming Expression

2019-09-13 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened SOLR-13622:
-

Uwe's jenkins servers weren't being included in my reports for over a month – 
that's why you didn't see any StreamExpressionTest failures in my reports when 
you looked last month.

in reality Uwe's windows builds started picking up a new type of failure: that 
file handles are being leaked (and thus the test framework can't close them)...

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=StreamExpressionTest -Dtests.seed=607225F2726A5625 -Dtests.slow=true 
-Dtests.locale=ar-PS -Dtests.timezone=Kwajalein -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.00s J1 | StreamExpressionTest (suite) <<<
   [junit4]> Throwable #1: java.io.IOException: Could not remove the 
following files (in the order of attempts):
   [junit4]>
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1\secondLevel2.txt:
 java.nio.file.FileSystemException: 
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1\secondLevel2.txt:
 The process cannot access the file because it is being used by another process.
   [junit4]>
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1:
 java.nio.file.DirectoryNotEmptyException: 
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles\directory1
   [junit4]>
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles:
 java.nio.file.DirectoryNotEmptyException: 
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2\userfiles
   [junit4]>
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2:
 java.nio.file.DirectoryNotEmptyException: 
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001\node2
   [junit4]>
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001:
 java.nio.file.DirectoryNotEmptyException: 
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001\tempDir-001
   [junit4]>
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001:
 java.nio.file.DirectoryNotEmptyException: 
C:\Users\jenkins\workspace\Lucene-Solr-master-Windows\solr\build\solr-solrj\test\J1\temp\solr.client.solrj.io.stream.StreamExpressionTest_607225F2726A5625-001
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([607225F2726A5625]:0)
   [junit4]>at org.apache.lucene.util.IOUtils.rm(IOUtils.java:319)
   [junit4]>at java.base/java.lang.Thread.run(Thread.java:835)
{noformat}

I've only ever seen {{secondLevel2.txt}} show up as being the problem -- based 
on how the test works, that suggests _either_ the multi file usage (ie 
{{cat("topLevel1.txt,directory1\secondLevel2.txt")}} usage causes it's _second_ 
arg to be leaked, _or_ the single arg directory usage (ie: 
{{cat("directory1")}} causes the last file in the directory to be leaked (*or 
both*)

skimming the code in CatStream this doesn't seem too suprising -- AFAICT the 
only time {{currentFileLines}} get's closed is when {{maxLines}} get's 
exceeded, or when {{allFilesToCrawl.hasNext()}} is true ... if there are no 
more files to crawl, or a file is 0 bytes (ie: {{currentFileLines.hasNext()}} 
never returns true) then the current / "last" file will never be closed.


> Add FileStream Streaming Expression
> ---
>
> Key: SOLR-13622
> URL: https://issues.apache.org/jira/browse/SOLR-13622
> Project: Solr
>  Issue Type: New Feature
>  Components: streaming expressions
>Reporter: Joel Bernstein
>

[jira] [Commented] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)

2019-09-10 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926763#comment-16926763
 ] 

Hoss Man commented on SOLR-13746:
-

thanks steve.

> Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)
> --
>
> Key: SOLR-13746
> URL: https://issues.apache.org/jira/browse/SOLR-13746
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> I just realized that back in June, there was a misscommunication between 
> myself & Uwe (and a lack of double checking on my part!) regarding upgrading 
> the JVM versions on our jenkins machines...
>  * 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e]
>  * 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E]
> ...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM 
> used on the _*apache*_  jenkins nodes is still (as of 2019-09-06)  
> "11.0.1+13-LTS" ...
> [https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText]
> {noformat}
> ...
> [java-info] java version "11.0.1"
> [java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle 
> Corporation)
> [java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle 
> Corporation)
> ...
> {noformat}
> This means that even after the changes made in SOLR-12988 to re-enable SSL 
> testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the 
> yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use 
> SSL (either explicitly, or due to randomization) becauseof the logic in 
> SSLTestConfig to detects  bad JVM versions an prevent confusion/spurious 
> failures.
> We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 
> ASAP.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9658) Caches should have an optional way to clean if idle for 'x' mins

2019-09-09 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926186#comment-16926186
 ] 

Hoss Man commented on SOLR-9658:


{quote}refactored cache impls to allow inserting synthetic entries, and changed 
the unit tests to use these methods. It turned out that the management of 
oldestEntry needs to be improved in all caches when we allow the creation time 
in more recently added entries to go back...
{quote}
Ah interesting ... IIUC the existing code (in ConcurrentLFUCache for example) 
just tracks "lastAccessed" for each cache entry (and "oldest" for the cache as 
a whole) via an incremented counter across all entries – but now you're using 
actual NANO_SECOND timestamps. This seems like an "ok" change (the API has 
never exposed these "lastAccessed" values correct?) but I just want to double 
check since you've looked at this & thought about it more then me: do you see 
any risk here? (ie: please don't let me talk you into an Impl change that's "a 
bad idea" just because it makes the kind of test I was advocating easier to 
write)

Feedback on other aspects of the patch (all minor and/or nitpicks – in 
generally this all seems solid) ...
 * AFAICT there should no longer be any need to modify TimeSource / 
TestTimeSource since tests no longer use/need advanceMs, correct ?
 * {{SolrCache.MAX_IDLE_TIME}} doesn't seem to have a name consistent w/the 
other variables in that interface ... seems like it should be 
{{SolrCache.MAX_IDLE_TIME_PARAM}} ?
 ** There are also a couple of places in LFUCache and LRUCache (where other 
existing {{*_PARAM}} constants are used) that seem to use the string literal 
{{"maxIdleTime"}} instead of using that new variable.
 * IIUC This isn't a mistake, it's a deliberate "clean up" change because the 
existing code includes this {{put(RAM_BYTES_USED_PARAM, ...)}} twice a few 
lines apart, correct? ...
{code:java}
-map.put(RAM_BYTES_USED_PARAM, ramBytesUsed());
+map.put("cumulative_idleEvictions", cidleEvictions);
{code}

 * Is there any reason not to make these final in both ConcurrentLFUCache & 
ConcurrentLRUCache?
{code:java}
private TimeSource timeSource = TimeSource.NANO_TIME;
private AtomicLong oldestEntry = new AtomicLong(0L);
{code}

 * re: this line in {{TestLFUCache.testMaxIdleTimeEviction}} ...
 ** {{assertEquals("markAndSweep spurious run", 1, sweepFinished.getCount());}}
 ** a more thread safe way to have this type of assertion...
{code:java}
final AtomicLong numSweepsStarted = new AtomicLong(0); // NEW
final CountDownLatch sweepFinished = new CountDownLatch(1);
ConcurrentLRUCache cache = new ConcurrentLRUCache<>(6, 5, 
5, 6, false, false, null, IDLE_TIME_SEC) {
  @Override
  public void markAndSweep() {
numSweepsStarted.incrementAndGet();  // NEW
super.markAndSweep();
sweepFinished.countDown();
  }
};
...
assertEquals("markAndSweep spurious runs", 0L, numSweepsStarted.get()); // 
CHANGED
{code}
 ** I think that pattern exists in another test as well?
 * we need to make sure the javadocs & ref-guide are updated to cover this new 
option, and be clear to users on how it interacts with other things (ie: that 
the idle sweep happens before the other sweeps and trumps things like the 
"entry size" checks)

> Caches should have an optional way to clean if idle for 'x' mins
> 
>
> Key: SOLR-9658
> URL: https://issues.apache.org/jira/browse/SOLR-9658
> Project: Solr
>  Issue Type: New Feature
>Reporter: Noble Paul
>Assignee: Andrzej Bialecki 
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-9658.patch, SOLR-9658.patch, SOLR-9658.patch, 
> SOLR-9658.patch, SOLR-9658.patch, SOLR-9658.patch
>
>
> If a cache is idle for long, it consumes precious memory. It should be 
> configurable to clear the cache if it was not accessed for 'x' secs. The 
> cache configuration can have an extra config {{maxIdleTime}} . if we wish it 
> to the cleaned after 10 mins of inactivity set it to {{maxIdleTime=600}}. 
> [~dragonsinth] would it be a solution for the memory leak you mentioned?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)

2019-09-09 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925947#comment-16925947
 ] 

Hoss Man commented on SOLR-13746:
-

bq. ... No idea, there is an issue / mail thread already at ASF about 
AdoptOpenJDK. ...
bq. ... I think we should get Infra involved, at a minimum to ask if we should 
be managing JDKs on a self-serve basis. ...

So i'm not really clear on where we stand now...

IIUC in order to upgrade past 11.0.1 (which is broken) we need to use 
(Adopt)OpenJDK because oracle hasn't made 11.0.(2|3|4) builds available? -- and 
 It sounds like is there an INFRA issue or mail archive thread somewhere about 
being able to use OpenJDK ... can someone post a link to that?  is that a 
discussion that's being had in public or in private? (even if it's private, can 
someone post a link to it so folks w/request karma can access it)

Is the infra conversation something happening "in the abstract" of "if/when 
OpenJDK builds can/should be used", or is it concretely about the need for 
specific projects to switch? ... ie:  Has the fact that 11.0.1 is broken and 
effectively unusable for a lot of Solr testing been mentioned in the context of 
that discussion?   Can/should we be filing an INFRA jira explicitly requesting 
upgraded JDKs so it's clear there is a demonstrable need? (does such an issue 
already exist already? can someone please link it here?)

Finally: is docker available on the jenkins build slaves, because worst case 
scenario we could tweak our apache jenkins jobs to run inside docker containers 
that always use the latest AdoptOpenJDK base images, ala...

https://github.com/hossman/solr-jenkins-docker-tester



bq. Should we also add this note to the JVM bugs page: 
https://cwiki.apache.org/confluence/display/lucene/JavaBugs#JavaBugs-OracleJava/SunJava/OpenJDKBugs

I thought someone already did this in response to an email thread about this 
general topci a ffew months ago -- but maybe not?

the list of known JVM SSL bugs is well documented in SOLR-12988 -- anyone who 
wants to take a stabe at summarizing that info in the wiki or release notes of 
Solr is welcome to take a stab at it (my focus has been on tests themselves and 
trying to figure out if there are any other SSL bugs we're overlooked ... 
something i'm now freaking out about more as i realized none of the apache 
jenkins jobs have actaully been testing SSL)

> Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)
> --
>
> Key: SOLR-13746
> URL: https://issues.apache.org/jira/browse/SOLR-13746
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> I just realized that back in June, there was a misscommunication between 
> myself & Uwe (and a lack of double checking on my part!) regarding upgrading 
> the JVM versions on our jenkins machines...
>  * 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e]
>  * 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E]
> ...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM 
> used on the _*apache*_  jenkins nodes is still (as of 2019-09-06)  
> "11.0.1+13-LTS" ...
> [https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText]
> {noformat}
> ...
> [java-info] java version "11.0.1"
> [java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle 
> Corporation)
> [java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle 
> Corporation)
> ...
> {noformat}
> This means that even after the changes made in SOLR-12988 to re-enable SSL 
> testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the 
> yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use 
> SSL (either explicitly, or due to randomization) becauseof the logic in 
> SSLTestConfig to detects  bad JVM versions an prevent confusion/spurious 
> failures.
> We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 
> ASAP.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13745) Test should close resources: AtomicUpdateProcessorFactoryTest

2019-09-09 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925923#comment-16925923
 ] 

Hoss Man commented on SOLR-13745:
-

bq.  ... It'd be nice if failing to close a SolrQueryRequest might be enforced 
in tests ...

I haven't dug into how/where exactly the ObjectTrracking logic helps enforce 
that we're closing things like SolrIndexSearcher, but in theory there isn't any 
reason it couldn't also enforce that we're closing (Local)SolrQueryRequest 
objects? ... i think?


> Test should close resources: AtomicUpdateProcessorFactoryTest 
> --
>
> Key: SOLR-13745
> URL: https://issues.apache.org/jira/browse/SOLR-13745
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Fix For: 8.3
>
>
> This tests hangs after the test runs because there are directory or request 
> resources (not sure yet) that are not closed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13745) Test should close resources: AtomicUpdateProcessorFactoryTest

2019-09-06 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924690#comment-16924690
 ] 

Hoss Man commented on SOLR-13745:
-

Interesting...

David: i suspect the reason these test bugs didn't manifest until after your 
commits in SOLR-13728 is because the new code you added in that issue causes 
DistributedUpdateProcessor to now call {{req.getSearcher().count(...)}} – 
resulting in {{SolrQueryRequestBase.searcherHolder}} getting populated in a way 
that it wouldn't have been previously for some of the {{LocalSolrQueryRequest}} 
instances used in this test.

As for why it didn't fail when you ran tests before committing SOLR-13728 ... 
i'm guessing that maybe this is because of SOLR-13747 / SOLR-12988 ?

(I've already confirmed SOLR-13746 is the reason [yetus's patch review build of 
SOLR-13728|https://builds.apache.org/job/PreCommit-SOLR-Build/543/testReport/] 
didn't catch this either)

> Test should close resources: AtomicUpdateProcessorFactoryTest 
> --
>
> Key: SOLR-13745
> URL: https://issues.apache.org/jira/browse/SOLR-13745
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Fix For: 8.3
>
>
> This tests hangs after the test runs because there are directory or request 
> resources (not sure yet) that are not closed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs

2019-09-06 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13747:

Attachment: SOLR-13747.patch
Status: Open  (was: Open)


Some background...

In SOLR-12988, during the dicsussion of re-enabling SSL testing under java11 
knowing that some java 11 versions were broken, I made the following comments...

{quote}
(on the Junit tests side, having assumes around JVM version is fine – because 
even then it's not a "silent" behavior change, it's an explicitly "test ignored 
because XYZ")
{quote}

{quote}
if devs are running tests with a broken JVM, then the tests can & should fail 
... that's the job of the tests. it's a bad idea to make the tests "hide" the 
failure by "faking" that things work using a degraded cipher, or skipping SSL 
completely (yes, i also think mark's changes to SSLTestConfig in December as 
part of his commit on this issue was a terrible idea as well) ... the *ONLY* 
thing we should _consider_ allowing tests to change about their behavior if 
they see a JVM is "broken" is to SKIP ie: 
assume(SomethingThatIsFalseForTheBrokenJVM)
{quote}

Ultimately, adding an {{SSLTestConfig.assumeSslIsSafeToTest()}} method seemed 
better then doing a hard {{fail(..)}} in any test that wanted to use SSL -- 
particularly once we realized that (at that time) every available version of 
Java 13 was affected by SSL bugs.  {{SKIP}} ing tests (instead of failing 
outright) ment we could still have jenkins jobs running the latest jdk13-ea 
available looking for _other_ bugs, w/o getting noise due to known SSL bugs.

But the fact that SOLR-13746 slipped through the cracks has caused me to 
seriously regret that decision -- and lead me to wonder:

* Do we have committers who are _still_ running {{ant test}} with "bad" JDKs 
that don't realize thousands of tests are getting skipped?
* What if down the road a jenkins node gets rebuilt/reverted to use an older 
jdk11 version, would anyone notice?



The attached patch adds a new 
{{TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs}} to the 
{{solr/test-framework}} module that does what i's name implies (with an 
informative message) when it detects that 
{{SSLTestConfig.assumeSslIsSafeToTest()}} throws an assumption in the the 
current JVM.

I considered just replacing {{SSLTestConfig.assumeSslIsSafeToTest()}} with a 
{{SSLTestConfig.failTheBuildUnlesseSslIsSafeToTest()}} but realized that the 
potential deluge of thousands of test failures that might occur for an aspiring 
contributor who attempts to run Solr tests w/no idea their JDK is broken could 
be overwhelming and scare people off before they even begin.  A single clear 
cut error (in addition to thousands of tests being {{SKIP}} ed) seemed more 
inviting.

I should note: It's possible that down the road we will again find ourselves in 
this situation...

bq. ...particularly once we realized that (at that time) every available 
version of Java 13 was affected by SSL bugs...

...with some future "Java XX", whose every available 'ea' build we recognize as 
being completely broken for SSL -- but we want still want to let jenkins try to 
look for _other_ bugs w/o the "noise" of this test failing every build.  If 
that day comes, we can update {{SSLTestConfig.assumeSslIsSafeToTest()}} to 
{{SKIP}} SSL on those JVM builds, and "whitelist" them in 
{{TestSSLTestConfig.testFailIfUserRunsTestsWithJVMThatHasKnownSSLBugs}}.




> 'ant test' should fail on JVM's w/known SSL bugs
> 
>
> Key: SOLR-13747
> URL: https://issues.apache.org/jira/browse/SOLR-13747
> Project: Solr
>  Issue Type: Test
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-13747.patch
>
>
> If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has 
> known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this 
> -- so the user knows they should upgrade their JVM (rather then relying on 
> the user to notice that SSL tests were {{SKIP}} ed)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs

2019-09-06 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13747:

Description: If {{ant test}} (or the future gradle equivalent) is run w/a 
JVM that has known SSL bugs, there should be an obvious {{BUILD FAILED}} 
because of this -- so the user knows they should upgrade their JVM (rather then 
relying on the user to notice that SSL tests were {{SKIP}} ed)  (was: 
If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has known 
SSL bugs, there should be an obvious {{BUILD FAILED}} because of this -- so the 
user knows they should upgrade their JVM (rather then relying on the user to 
notice that SSL tests were {{SKIP}}ed))

> 'ant test' should fail on JVM's w/known SSL bugs
> 
>
> Key: SOLR-13747
> URL: https://issues.apache.org/jira/browse/SOLR-13747
> Project: Solr
>  Issue Type: Test
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has 
> known SSL bugs, there should be an obvious {{BUILD FAILED}} because of this 
> -- so the user knows they should upgrade their JVM (rather then relying on 
> the user to notice that SSL tests were {{SKIP}} ed)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13747) 'ant test' should fail on JVM's w/known SSL bugs

2019-09-06 Thread Hoss Man (Jira)
Hoss Man created SOLR-13747:
---

 Summary: 'ant test' should fail on JVM's w/known SSL bugs
 Key: SOLR-13747
 URL: https://issues.apache.org/jira/browse/SOLR-13747
 Project: Solr
  Issue Type: Test
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man



If {{ant test}} (or the future gradle equivalent) is run w/a JVM that has known 
SSL bugs, there should be an obvious {{BUILD FAILED}} because of this -- so the 
user knows they should upgrade their JVM (rather then relying on the user to 
notice that SSL tests were {{SKIP}}ed)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)

2019-09-06 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924661#comment-16924661
 ] 

Hoss Man commented on SOLR-13746:
-

[~thetaphi] / [~steve_rowe] - is this still something you guys have control 
over, or do we need to get infra involved?

> Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)
> --
>
> Key: SOLR-13746
> URL: https://issues.apache.org/jira/browse/SOLR-13746
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>
> I just realized that back in June, there was a misscommunication between 
> myself & Uwe (and a lack of double checking on my part!) regarding upgrading 
> the JVM versions on our jenkins machines...
>  * 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e]
>  * 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E]
> ...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM 
> used on the _*apache*_  jenkins nodes is still (as of 2019-09-06)  
> "11.0.1+13-LTS" ...
> [https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText]
> {noformat}
> ...
> [java-info] java version "11.0.1"
> [java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle 
> Corporation)
> [java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle 
> Corporation)
> ...
> {noformat}
> This means that even after the changes made in SOLR-12988 to re-enable SSL 
> testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the 
> yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use 
> SSL (either explicitly, or due to randomization) becauseof the logic in 
> SSLTestConfig to detects  bad JVM versions an prevent confusion/spurious 
> failures.
> We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 
> ASAP.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13746) Apache jenkins needs JVM 11 upgraded to at least 11.0.3 (SSL bugs)

2019-09-06 Thread Hoss Man (Jira)
Hoss Man created SOLR-13746:
---

 Summary: Apache jenkins needs JVM 11 upgraded to at least 11.0.3 
(SSL bugs)
 Key: SOLR-13746
 URL: https://issues.apache.org/jira/browse/SOLR-13746
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man


I just realized that back in June, there was a misscommunication between myself 
& Uwe (and a lack of double checking on my part!) regarding upgrading the JVM 
versions on our jenkins machines...
 * 
[http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3calpine.DEB.2.11.1906181434350.23523@tray%3e]
 * 
[http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E]

...Uwe only updated the JVMs on _his_ policeman jenkins machines - the JVM used 
on the _*apache*_  jenkins nodes is still (as of 2019-09-06)  "11.0.1+13-LTS" 
...

[https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3689/consoleText]
{noformat}
...
[java-info] java version "11.0.1"
[java-info] Java(TM) SE Runtime Environment (11.0.1+13-LTS, Oracle Corporation)
[java-info] Java HotSpot(TM) 64-Bit Server VM (11.0.1+13-LTS, Oracle 
Corporation)
...
{noformat}
This means that even after the changes made in SOLR-12988 to re-enable SSL 
testing on java11, all Apache jenkins 'master' builds, (including, AFAICT the 
yetus / 'Patch Review' builds) are still SKIPping thousands of tests that use 
SSL (either explicitly, or due to randomization) becauseof the logic in 
SSLTestConfig to detects  bad JVM versions an prevent confusion/spurious 
failures.

We really need to get the jenkins nodes updated to openjdk 11.0.3 or 11.0.4 
ASAP.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13728) Fail partial updates if it would inadvertently remove nested docs

2019-09-06 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924463#comment-16924463
 ] 

Hoss Man commented on SOLR-13728:
-


Huh?

No i'm directly refering to Commit c8203e4787b8ad21e1270781ba4e09fd7f3acb00 ...

{noformat}
hossman@slate:~/lucene/dev [j11] [master] $ git co 
c8203e4787b8ad21e1270781ba4e09fd7f3acb00 && ant clean && cd solr/core/ && ant 
test -Dtestcase=AtomicUpdateProcessorFactoryTest
...
   [junit4]   2> NOTE: Linux 5.0.0-27-generic amd64/AdoptOpenJDK 11.0.4 
(64-bit)/cpus=8,threads=2,free=199278080,total=522190848
   [junit4]   2> NOTE: All tests run in this JVM: 
[AtomicUpdateProcessorFactoryTest]
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=AtomicUpdateProcessorFactoryTest -Dtests.seed=9CA837338CB8D055 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=eu-ES 
-Dtests.timezone=Indian/Kerguelen -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.00s | AtomicUpdateProcessorFactoryTest (suite) <<<
   [junit4]> Throwable #1: java.lang.AssertionError: ObjectTracker found 6 
object(s) that were not released!!! [SolrCore, SolrIndexSearcher, 
MockDirectoryWrapper, MockDirectoryWrapper, SolrIndexSearcher, 
MockDirectoryWrapper]
   [junit4]> 
org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException: 
org.apache.solr.core.SolrCore
   [junit4]>at 
org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:42)
   [junit4]>at 
org.apache.solr.core.SolrCore.(SolrCore.java:1093)
...



hossman@slate:~/lucene/dev/solr/core [j11] [c8203e4787b] $ cd ../../ && git co 
c8203e4787b8ad21e1270781ba4e09fd7f3acb00~1
Previous HEAD position was c8203e4787b SOLR-13728: fail partial updates to 
child docs when not supported.
HEAD is now at 2552986e872 LUCENE-8917: Fix Solr's TestCodecSupport to stop 
trying to use the now-removed Direct docValues format


hossman@slate:~/lucene/dev [j11] [2552986e872] $ ant clean && cd solr/core/ && 
ant test -Dtestcase=AtomicUpdateProcessorFactoryTest
...
common.test:

BUILD SUCCESSFUL
Total time: 1 minute 10 seconds



hossman@slate:~/lucene/dev/solr/core [j11] [2552986e872] $ ant test  
-Dtestcase=AtomicUpdateProcessorFactoryTest -Dtests.seed=9CA837338CB8D055 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=eu-ES 
-Dtests.timezone=Indian/Kerguelen -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
...
common.test:

BUILD SUCCESSFUL
Total time: 19 seconds
{noformat}

> Fail partial updates if it would inadvertently remove nested docs
> -
>
> Key: SOLR-13728
> URL: https://issues.apache.org/jira/browse/SOLR-13728
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Fix For: 8.3
>
> Attachments: SOLR-13728.patch
>
>
> In SOLR-12638 Solr gained the ability to do partial updates (aka atomic 
> updates) to nested documents.  However this feature only works if the schema 
> meets certain circumstances.  We can know we don't support it and fail the 
> request – what I propose here.  This is much friendlier than wiping out 
> existing documents.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-13728) Fail partial updates if it would inadvertently remove nested docs

2019-09-06 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened SOLR-13728:
-

these commits appear to be the cause of a 100% failure rate in {{ant test 
-Dtestcase=AtomicUpdateProcessorFactoryTest}} in recent jenkins builds.

the failures reproduce for me on master, regardless of see or any other jvm 
options (haven't tested branch_8x) yet.

the failures related to tracking of unclosed directories...

{noformat}
   [junit4]   2> 17393 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 
] o.a.s.c.CachingDirectoryFactory Timeout waiting for all directory ref counts 
to be released - gave up waiting on 
CachedDir<>
   [junit4]   2> 17397 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 
] o.a.s.c.CachingDirectoryFactory Error closing 
directory:org.apache.solr.common.SolrException: Timeout waiting for all 
directory ref counts to be released - gave up waiting on 
CachedDir<>
   [junit4]   2>at 
org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:178)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.close(SolrCore.java:1699)
   [junit4]   2>at 
org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:139)
   [junit4]   2>at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   [junit4]   2>at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [junit4]   2>at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> 
   [junit4]   2> 17399 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 
] o.a.s.c.SolrCore java.lang.AssertionError: 2
   [junit4]   2>at 
org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:192)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.close(SolrCore.java:1699)
   [junit4]   2>at 
org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:139)
   [junit4]   2>at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   [junit4]   2>at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [junit4]   2>at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> 
   [junit4]   2> 17399 ERROR (coreCloseExecutor-15-thread-1) [x:collection1 
] o.a.s.c.SolrCores Error shutting down core:java.lang.AssertionError: 2
   [junit4]   2>at 
org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:192)
   [junit4]   2>at 
org.apache.solr.core.SolrCore.close(SolrCore.java:1699)
   [junit4]   2>at 
org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:139)
   [junit4]   2>at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   [junit4]   2>at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [junit4]   2>at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> 
...
   [junit4]   2> 78497 INFO  
(SUITE-AtomicUpdateProcessorFactoryTest-seed#[4E875A6AF0417D9C]-worker) [ ] 
o.a.s.SolrTestCaseJ4 --- 
Done waiting for tracked resources to be released
   [junit4]   2> NOTE: test params are: codec=Lucene80, 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@917add1),
 locale=sr-Cyrl-ME, timezone=Canada/Saskatchewan
   [junit4]   2> NOTE: Linux 5.0.0-27-generic amd64/AdoptOpenJDK 11.0.4 
(64-bit)/cpus=8,threads=2,free=407897088,total=522190848
   [junit4]   2> NOTE: All tests run in this JVM: 
[AtomicUpdateProcessorFactoryTest]
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=AtomicUpdateProcessorFactoryTest -Dtests.seed=4E875A6AF0417D9C 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=sr-Cyrl-ME 
-Dtests.timezone=Canada/Saskatchewan -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   0.00s | AtomicUpdateProcessorFactoryTest (suite) <<<
   [junit4]> Throwable #1: java.lang.AssertionError: ObjectTracker found 6 
object(s) that were not released!!! [SolrCore, 

[jira] [Resolved] (LUCENE-8917) Remove the "Direct" doc-value format

2019-09-05 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved LUCENE-8917.
--
Resolution: Fixed

i think we're all good now -- that looks like the only affected test.

> Remove the "Direct" doc-value format
> 
>
> Key: LUCENE-8917
> URL: https://issues.apache.org/jira/browse/LUCENE-8917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0)
>
>
> This is the last user of the Legacy*DocValues APIs. Another option would be 
> to move this format to doc-value iterators, but I don't think it's worth the 
> effort: let's just remove it in Lucene 9?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-8917) Remove the "Direct" doc-value format

2019-09-05 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened LUCENE-8917:
--

this seems to have caused some reliable solr test failures?

Example...
https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-Tests-master/3683/
{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestCodecSupport 
-Dtests.method=testDynamicFieldsDocValuesFormats -Dtests.seed=FA28EF8B1D76D0FE 
-Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=ru 
-Dtests.timezone=Europe/Tiraspol -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.04s J1 | 
TestCodecSupport.testDynamicFieldsDocValuesFormats <<<
   [junit4]> Throwable #1: java.lang.IllegalArgumentException: An SPI class 
of type org.apache.lucene.codecs.DocValuesFormat with name 'Direct' does not 
exist.  You need to add the corresponding JAR file supporting this SPI to your 
classpath.  The current classpath supports the following names: [Asserting, 
Lucene70, Lucene80]
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([FA28EF8B1D76D0FE:1AFBB14D0BE866AA]:0)
   [junit4]>at 
org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:116)
   [junit4]>at 
org.apache.lucene.codecs.DocValuesFormat.forName(DocValuesFormat.java:108)
   [junit4]>at 
org.apache.solr.core.SchemaCodecFactory$1.getDocValuesFormatForField(SchemaCodecFactory.java:112)
   [junit4]>at 
org.apache.lucene.codecs.lucene80.Lucene80Codec$2.getDocValuesFormatForField(Lucene80Codec.java:74)
   [junit4]>at 
org.apache.solr.core.TestCodecSupport.testDynamicFieldsDocValuesFormats(TestCodecSupport.java:87)
...
{noformat}

..probably just some tests that need removed/updated to not try to use Direct 
as an option anymore?

> Remove the "Direct" doc-value format
> 
>
> Key: LUCENE-8917
> URL: https://issues.apache.org/jira/browse/LUCENE-8917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0)
>
>
> This is the last user of the Legacy*DocValues APIs. Another option would be 
> to move this format to doc-value iterators, but I don't think it's worth the 
> effort: let's just remove it in Lucene 9?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13741) possible AuditLogger bugs uncovered while hardening AuditLoggerIntegrationTest

2019-09-04 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13741:

Attachment: SOLR-13741.patch
Status: Open  (was: Open)

Attaching my patch, note that at the moment this patch only modifies 
{{AuditLoggerIntegrationTest}} and does not yet address the '#1' comment I made 
above regarding the 'delay' option on {{CallbackAuditLoggerPlugin}} – there are 
additional nocommit comments regarding the planned changes for that, but I 
didn't want to start on those changes until these existing uncertainties were 
addressed.

[~janhoy] : I would greatly appreciate your review here to help clear up the 
"correct test, bad behavior" vs "correct behavior, bad test" questions.

> possible AuditLogger bugs uncovered while hardening AuditLoggerIntegrationTest
> --
>
> Key: SOLR-13741
> URL: https://issues.apache.org/jira/browse/SOLR-13741
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13741.patch
>
>
> A while back i saw a weird non-reproducible failure from 
> AuditLoggerIntegrationTest.  When i started reading through that code, 2 
> things jumped out at me:
> # the way the 'delay' option works is brittle, and makes assumptions about 
> CPU scheduling that aren't neccessarily going to be true (and also suffers 
> from the problem that Thread.sleep isn't garunteed to sleep as long as you 
> ask it too)
> # the way the existing {{waitForAuditEventCallbacks(number)}} logic works by 
> checking the size of a (List) {{buffer}} of recieved events in a sleep/poll 
> loop, until it contains at least N items -- but the code that adds items to 
> that buffer in the async Callback thread async _before_ the code that updates 
> other state variables (like the global {{count}} and the patch specific 
> {{resourceCounts}}) meaning that a test waiting on 3 events could "see" 3 
> events added to the buffer, but calling {{assertEquals(3, 
> receiver.getTotalCount())}} could subsequently fail because that variable 
> hadn't been udpated yet.
> #2 was the source of the failures I was seeing, and while a quick fix for 
> that specific problem would be to update all other state _before_ adding the 
> event to the buffer, I set out to try and make more general improvements to 
> the test:
> * eliminate the dependency on sleep loops by {{await}}-ing on concurrent data 
> structures
> * harden the assertions made about the expected events recieved (updating 
> some test methods that currently just assert the number of events recieved)
> * add new assertions that _only_ the expected events are recieved.
> In the process of doing this, I've found several oddities/descrepencies 
> between things the test currently claims/asserts, and what *actually* happens 
> under more rigerous scrutiny/assertions.
> I'll attach a patch shortly that has my (in progress) updates and inlcudes 
> copious nocommits about things seem suspect.  the summary of these concerns 
> is:
> * SolrException status codes that do not match what the existing test says 
> they should (but doesn't assert)
> * extra AuditEvents occuring that the existing test does not expect
> * AuditEvents for incorrect credentials that do not at all match the expected 
> AuditEvent in the existing test -- which the current test seems to miss in 
> it's assertions because it's picking up some extra events from triggered by 
> previuos requests earlier in the test that just happen to also match the 
> asserctions.
> ...it's not clear to me if the test logic is correct and these are "code 
> bugs" or if the test is faulty.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13741) possible AuditLogger bugs uncovered while hardening AuditLoggerIntegrationTest

2019-09-04 Thread Hoss Man (Jira)
Hoss Man created SOLR-13741:
---

 Summary: possible AuditLogger bugs uncovered while hardening 
AuditLoggerIntegrationTest
 Key: SOLR-13741
 URL: https://issues.apache.org/jira/browse/SOLR-13741
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
Assignee: Hoss Man


A while back i saw a weird non-reproducible failure from 
AuditLoggerIntegrationTest.  When i started reading through that code, 2 things 
jumped out at me:

# the way the 'delay' option works is brittle, and makes assumptions about CPU 
scheduling that aren't neccessarily going to be true (and also suffers from the 
problem that Thread.sleep isn't garunteed to sleep as long as you ask it too)
# the way the existing {{waitForAuditEventCallbacks(number)}} logic works by 
checking the size of a (List) {{buffer}} of recieved events in a sleep/poll 
loop, until it contains at least N items -- but the code that adds items to 
that buffer in the async Callback thread async _before_ the code that updates 
other state variables (like the global {{count}} and the patch specific 
{{resourceCounts}}) meaning that a test waiting on 3 events could "see" 3 
events added to the buffer, but calling {{assertEquals(3, 
receiver.getTotalCount())}} could subsequently fail because that variable 
hadn't been udpated yet.

#2 was the source of the failures I was seeing, and while a quick fix for that 
specific problem would be to update all other state _before_ adding the event 
to the buffer, I set out to try and make more general improvements to the test:

* eliminate the dependency on sleep loops by {{await}}-ing on concurrent data 
structures
* harden the assertions made about the expected events recieved (updating some 
test methods that currently just assert the number of events recieved)
* add new assertions that _only_ the expected events are recieved.

In the process of doing this, I've found several oddities/descrepencies between 
things the test currently claims/asserts, and what *actually* happens under 
more rigerous scrutiny/assertions.

I'll attach a patch shortly that has my (in progress) updates and inlcudes 
copious nocommits about things seem suspect.  the summary of these concerns is:

* SolrException status codes that do not match what the existing test says they 
should (but doesn't assert)
* extra AuditEvents occuring that the existing test does not expect
* AuditEvents for incorrect credentials that do not at all match the expected 
AuditEvent in the existing test -- which the current test seems to miss in it's 
assertions because it's picking up some extra events from triggered by previuos 
requests earlier in the test that just happen to also match the asserctions.


...it's not clear to me if the test logic is correct and these are "code bugs" 
or if the test is faulty.




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13709) Race condition on core reload while core is still loading?

2019-09-03 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921632#comment-16921632
 ] 

Hoss Man edited comment on SOLR-13709 at 9/3/19 6:27 PM:
-

-Commit 86e8c44be472556c8a905deb338cafa803ee6ee0 in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter-
 -[ [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86e8c44] ]-

-SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified-

-(cherry picked from commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93)-

EDIT: not actually relevant to this issue, sorry.


was (Author: jira-bot):
Commit 86e8c44be472556c8a905deb338cafa803ee6ee0 in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86e8c44 ]

SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified

(cherry picked from commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93)


> Race condition on core reload while core is still loading?
> --
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Erick Erickson
>Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13709) Race condition on core reload while core is still loading?

2019-09-03 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921619#comment-16921619
 ] 

Hoss Man edited comment on SOLR-13709 at 9/3/19 6:27 PM:
-

-Commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter-
 -[ [https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=83cd54f] ]-

-SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified-

EDIT: not actually relevant to this issue, sorry.


was (Author: jira-bot):
Commit 83cd54f80157916b364bb5ebde20a66cbd5d3d93 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=83cd54f ]

SOLR-13709: Fixed distributed grouping when multiple 'fl' params are specified


> Race condition on core reload while core is still loading?
> --
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Erick Erickson
>Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param

2019-09-03 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921644#comment-16921644
 ] 

Hoss Man commented on SOLR-13717:
-

Gah, juggling too many tabs/issues at the same time.

Primary commits related to this issue...
* master: https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=83cd54f
** 83cd54f80157916b364bb5ebde20a66cbd5d3d93
* branch_8x: https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=86e8c44
** 86e8c44be472556c8a905deb338cafa803ee6ee0

> Distributed Grouping breaks multi valued 'fl' param
> ---
>
> Key: SOLR-13717
> URL: https://issues.apache.org/jira/browse/SOLR-13717
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: SOLR-13717.patch, SOLR-13717.patch
>
>
> Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} 
> params are specified.
> {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that 
> assumes there will be 0 or 1 {{fl}} params in the original request that it 
> should inspect to see if it needs to append (via string concat) the uniqueKey 
> field onto in order to collate the returned stored fields into their 
> respective (grouped) documents -- and then ignores any additional {{fl}} 
> params that may exist in the original request when it does so.
> The net result is that only the uniqueKey field and whatever fields _are_ 
> specified in the first {{fl}} param specified are fetched from each shard and 
> ultimately returned.
> The only workaround is to replace multiple {{fl}} params with a single {{fl}} 
> param containing a comma seperated list of the requested fields.
> 
> Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing 
> these requests which should all be equivilent...
> {noformat}
> $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv
> ...
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author,name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "name":["The Black Company"],
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "name":["Foundation"],
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author=name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=id=author=name=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0553573403"}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354"}]
>   }}]}}}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param

2019-09-03 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13717:

Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Christine: thanks for the review, and for catching & fixing my test laziness.  
much cleaner.

> Distributed Grouping breaks multi valued 'fl' param
> ---
>
> Key: SOLR-13717
> URL: https://issues.apache.org/jira/browse/SOLR-13717
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: SOLR-13717.patch, SOLR-13717.patch
>
>
> Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} 
> params are specified.
> {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that 
> assumes there will be 0 or 1 {{fl}} params in the original request that it 
> should inspect to see if it needs to append (via string concat) the uniqueKey 
> field onto in order to collate the returned stored fields into their 
> respective (grouped) documents -- and then ignores any additional {{fl}} 
> params that may exist in the original request when it does so.
> The net result is that only the uniqueKey field and whatever fields _are_ 
> specified in the first {{fl}} param specified are fetched from each shard and 
> ultimately returned.
> The only workaround is to replace multiple {{fl}} params with a single {{fl}} 
> param containing a comma seperated list of the requested fields.
> 
> Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing 
> these requests which should all be equivilent...
> {noformat}
> $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv
> ...
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author,name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "name":["The Black Company"],
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "name":["Foundation"],
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author=name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=id=author=name=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0553573403"}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354"}]
>   }}]}}}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?

2019-09-03 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921597#comment-16921597
 ] 

Hoss Man commented on SOLR-13709:
-

just to be clear, my primary concern when i created this issue was that it was 
evident from the test failure logs that core reloading (and as erick points 
out: potentially other core level ops) could occur in a race condition with the 
core itself loading.

my comments about {{SolrCores.getCoreDescriptor(String)}} and if/when/why/how  
it should block on attempts to ccess a core by name if/while that core was 
loading were based *solely* on the exsting javadocs for that method.

if those javadocs are and have always been wrong, then trying to "fix" that 
method to match the javadocs isn't necessarily the best solution -- especially 
if doing so causes lots of other problems.  we can always just update the 
javadocs, making a note of when/why/how the value may be null, and audit the 
callers to ensure they are accounting for the possibility of null and handling 
that value in whatever way makes the most sense for the situation (throw NPE, 
throw a diff exception, fail a command, etc...)

i should point out, i have no idea if a "user level" Core RELOAD (or SWAP or 
UNLOAD) op (ie: something triggered externally via /admin/cores, or via 
overseer) also has this problem, or already accounts for the possibility that a 
core may not yet be loaded -- it may simply be that this particular ZkWatcher 
that registered by the core to watch the schema is itself broken, and should be 
checking some more explicit state to block and take no action until the core is 
fully loaded.

As far as testing...

[~erickerickson] - it's not really clear to me what/where/how you're currently 
trying to test this? ... as i mentioned, it's kind of a fluke that 
TestSolrCLIRunExample triggered this failure at all, and even when it did it 
didn't really "fail" in a reliable way that was oviously related to this 
specifit bug.  

I would suggest that a more robust way to test this would be with a more 
targeted non-cloud test, using a custom plugin (searcher handler, component, 
whatever...) that spins up a background thread to trigger schema updates in ZK 
(so that the problematic watcher which does a core reload on schema changes 
will then fire) and then the custom component should "stall" for some amount of 
time (ideally {{await}}-ing on something instead of an arbitrary sleep, but i 
haven't thought it through enough to know what exact condition it could await 
on) to force a delay in the completeion of the SolrCore loading.  Then your 
test just tries to initialize a SolrCore with a config that uses this custom 
plugin, and asserts that the SolrCore initializes fine *AND* that it 
(eventually) picks up the updated schema (via polling on the schema API?)

make sense?

> Race condition on core reload while core is still loading?
> --
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Erick Erickson
>Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param

2019-08-23 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13717:

Status: Patch Available  (was: Open)

> Distributed Grouping breaks multi valued 'fl' param
> ---
>
> Key: SOLR-13717
> URL: https://issues.apache.org/jira/browse/SOLR-13717
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13717.patch
>
>
> Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} 
> params are specified.
> {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that 
> assumes there will be 0 or 1 {{fl}} params in the original request that it 
> should inspect to see if it needs to append (via string concat) the uniqueKey 
> field onto in order to collate the returned stored fields into their 
> respective (grouped) documents -- and then ignores any additional {{fl}} 
> params that may exist in the original request when it does so.
> The net result is that only the uniqueKey field and whatever fields _are_ 
> specified in the first {{fl}} param specified are fetched from each shard and 
> ultimately returned.
> The only workaround is to replace multiple {{fl}} params with a single {{fl}} 
> param containing a comma seperated list of the requested fields.
> 
> Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing 
> these requests which should all be equivilent...
> {noformat}
> $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv
> ...
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author,name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "name":["The Black Company"],
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "name":["Foundation"],
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author=name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=id=author=name=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0553573403"}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354"}]
>   }}]}}}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param

2019-08-23 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13717:

Attachment: SOLR-13717.patch
Status: Open  (was: Open)


Attached patch includes a fix and some new test coverage of distributed 
grouping w/various options comparing the results when using a single {{fl}} vs 
equivilent multivalued {{fl}} params.



> Distributed Grouping breaks multi valued 'fl' param
> ---
>
> Key: SOLR-13717
> URL: https://issues.apache.org/jira/browse/SOLR-13717
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13717.patch
>
>
> Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} 
> params are specified.
> {{StoredFieldsShardRequestFactory}} has very (old and) brittle code that 
> assumes there will be 0 or 1 {{fl}} params in the original request that it 
> should inspect to see if it needs to append (via string concat) the uniqueKey 
> field onto in order to collate the returned stored fields into their 
> respective (grouped) documents -- and then ignores any additional {{fl}} 
> params that may exist in the original request when it does so.
> The net result is that only the uniqueKey field and whatever fields _are_ 
> specified in the first {{fl}} param specified are fetched from each shard and 
> ultimately returned.
> The only workaround is to replace multiple {{fl}} params with a single {{fl}} 
> param containing a comma seperated list of the requested fields.
> 
> Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing 
> these requests which should all be equivilent...
> {noformat}
> $ bin/post -c gettingstarted -out yes example/exampledocs/books.csv
> ...
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author,name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "name":["The Black Company"],
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "name":["Foundation"],
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author=name,id=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0812521390",
> "author":["Glen Cook"]}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354",
> "author":["Isaac Asimov"]}]
>   }}]}}}
> $ curl 
> 'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=id=author=name=*:*=true=genre_s'
> {
>   "grouped":{
> "genre_s":{
>   "matches":10,
>   "groups":[{
>   "groupValue":"fantasy",
>   "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
>   {
> "id":"0553573403"}]
>   }},
> {
>   "groupValue":"scifi",
>   "doclist":{"numFound":2,"start":0,"docs":[
>   {
> "id":"0553293354"}]
>   }}]}}}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13717) Distributed Grouping breaks multi valued 'fl' param

2019-08-23 Thread Hoss Man (Jira)
Hoss Man created SOLR-13717:
---

 Summary: Distributed Grouping breaks multi valued 'fl' param
 Key: SOLR-13717
 URL: https://issues.apache.org/jira/browse/SOLR-13717
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
Assignee: Hoss Man



Co-worker discovered a bug with (distributed) grouping when multiple {{fl}} 
params are specified.

{{StoredFieldsShardRequestFactory}} has very (old and) brittle code that 
assumes there will be 0 or 1 {{fl}} params in the original request that it 
should inspect to see if it needs to append (via string concat) the uniqueKey 
field onto in order to collate the returned stored fields into their respective 
(grouped) documents -- and then ignores any additional {{fl}} params that may 
exist in the original request when it does so.

The net result is that only the uniqueKey field and whatever fields _are_ 
specified in the first {{fl}} param specified are fetched from each shard and 
ultimately returned.

The only workaround is to replace multiple {{fl}} params with a single {{fl}} 
param containing a comma seperated list of the requested fields.



Bug is trivial to reproduce with {{bin/solr -e cloud -noprompt}} by comparing 
these requests which should all be equivilent...

{noformat}
$ bin/post -c gettingstarted -out yes example/exampledocs/books.csv
...
$ curl 
'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author,name,id=*:*=true=genre_s'
{
  "grouped":{
"genre_s":{
  "matches":10,
  "groups":[{
  "groupValue":"fantasy",
  "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
  {
"id":"0812521390",
"name":["The Black Company"],
"author":["Glen Cook"]}]
  }},
{
  "groupValue":"scifi",
  "doclist":{"numFound":2,"start":0,"docs":[
  {
"id":"0553293354",
"name":["Foundation"],
"author":["Isaac Asimov"]}]
  }}]}}}
$ curl 
'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=author=name,id=*:*=true=genre_s'
{
  "grouped":{
"genre_s":{
  "matches":10,
  "groups":[{
  "groupValue":"fantasy",
  "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
  {
"id":"0812521390",
"author":["Glen Cook"]}]
  }},
{
  "groupValue":"scifi",
  "doclist":{"numFound":2,"start":0,"docs":[
  {
"id":"0553293354",
"author":["Isaac Asimov"]}]
  }}]}}}
$ curl 
'http://localhost:8983/solr/gettingstarted/query?omitHeader=true=true=id=author=name=*:*=true=genre_s'
{
  "grouped":{
"genre_s":{
  "matches":10,
  "groups":[{
  "groupValue":"fantasy",
  "doclist":{"numFound":8,"start":0,"maxScore":1.0,"docs":[
  {
"id":"0553573403"}]
  }},
{
  "groupValue":"scifi",
  "doclist":{"numFound":2,"start":0,"docs":[
  {
"id":"0553293354"}]
  }}]}}}
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?

2019-08-23 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914665#comment-16914665
 ] 

Hoss Man commented on SOLR-13709:
-

{quote}Is there a possibility that this is happening before 
CoreContainer.load() is finished?
{quote}
it's absolutely possible – that's the point i made when i created this issue:
bq. ...AFAICT the only way this NPE is possily is if the CoreDescriptor for the 
original SolrCore is NULL at the time the watcher fires, and the only 
concievable way that seems to be possible is if the original SolrCore hadn't 
completley finished loading.

Acording to the docs of that method it _should_ block until the core is loaded, 
 but the ZkWatcher thread  in question -- set by the SolrCore durring it's own 
init in order to reload the core if the schema changes -- is calling 
{{getCoreDescriptor()}} in order to reload the core and getting null.



> Race condition on core reload while core is still loading?
> --
>
> Key: SOLR-13709
> URL: https://issues.apache.org/jira/browse/SOLR-13709
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Erick Erickson
>Priority: Major
> Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
> there may be a race condition when attempting to re-load a SolrCore while the 
> core is currently in the process of (re)loading that can leave the SolrCore 
> in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13709) Race condition on core reload while core is still loading?

2019-08-21 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13709:

Attachment: apache_Lucene-Solr-Tests-8.x_449.log.txt
Status: Open  (was: Open)

I've attached the logs from the jenkins run in question...

Interestingly: Even though the logs indicate several problems in trying to 
reload/unload the SolrCore, the test itself didn't seem to care enough about 
the state of the collection to notice the problems – the only junit failure 
recorded was a suite level failure from the ObjectTracker due to unreleased 
threads/objects.

The first sign of trouble in the logs is this WARN from a ZK watcher registered 
to monitor the schema for changes (in schemaless mode) in order to re-load the 
SolrCore – it fails with an NullPointerException...
{noformat}
   [junit4]   2> 4309877 WARN  (Thread-7591) [n:localhost:38920_solr 
c:testCloudExamplePrompt s:shard2 r:core_node7 
x:testCloudExamplePrompt_shard2_replica_n4 ] o.a.s.c.ZkController liste
ner throws error
   [junit4]   2>   => org.apache.solr.common.SolrException: Unable to 
reload core [testCloudExamplePrompt_shard2_replica_n6]
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557)
   [junit4]   2> org.apache.solr.common.SolrException: Unable to reload core 
[testCloudExamplePrompt_shard2_replica_n6]
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557) ~[java/:?]
   [junit4]   2>at 
org.apache.solr.core.SolrCore.lambda$getConfListener$21(SolrCore.java:3099) 
~[java/:?]
   [junit4]   2>at 
org.apache.solr.cloud.ZkController.lambda$fireEventListeners$14(ZkController.java:2514)
 ~[java/:?]
   [junit4]   2>at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
   [junit4]   2> Caused by: java.lang.NullPointerException
   [junit4]   2>at 
org.apache.solr.core.CoreDescriptor.(CoreDescriptor.java:172) ~[java/:?]
   [junit4]   2>at 
org.apache.solr.core.SolrCore.reload(SolrCore.java:683) ~[java/:?]
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1507) ~[java/:?]
   [junit4]   2>... 3 more
{noformat}
...AFAICT the only way this NPE is possily is if the CoreDescriptor for the 
_original_ SolrCore is NULL at the time the watcher fires, and the only 
concievable way that seems to be possible is if the original SolrCore hadn't 
completley finished loading.

Aparently as a result of this failure to reload, a 
SolrCoreInitializationException is recorded for the core name, and that 
ultimately causes a fast-failure response when trying to unload the core...
{noformat}
   [junit4]   2> 4310314 ERROR (qtp373709619-50629) [n:localhost:38920_solr
x:testCloudExamplePrompt_shard2_replica_n6 ] o.a.s.h.RequestHandlerBase 
org.apache.solr.common.SolrException
: Error unregistering core [testCloudExamplePrompt_shard2_replica_n6] from 
cloud state
   [junit4]   2>at 
org.apache.solr.core.CoreContainer.unload(CoreContainer.java:1672)
   [junit4]   2>at 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$1(CoreAdminOperation.java:105)
   [junit4]   2>at 
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
   [junit4]   2>at 
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
   [junit4]   2>at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
   [junit4]   2>at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:200)
   [junit4]   2>at 
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:820)
   [junit4]   2>at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:786)
   [junit4]   2>at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:546)
   [junit4]   2>at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423)
   [junit4]   2>at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
   [junit4]   2>at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
   [junit4]   2>at 
org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:165)
   [junit4]   2>at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
   [junit4]   2>at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
   [junit4]   2>at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
   [junit4]   2>at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
   [junit4]   2>at 

[jira] [Created] (SOLR-13709) Race condition on core reload while core is still loading?

2019-08-21 Thread Hoss Man (Jira)
Hoss Man created SOLR-13709:
---

 Summary: Race condition on core reload while core is still loading?
 Key: SOLR-13709
 URL: https://issues.apache.org/jira/browse/SOLR-13709
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man



A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that 
there may be a race condition when attempting to re-load a SolrCore while the 
core is currently in the process of (re)loading that can leave the SolrCore in 
an unusable state.

Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests

2019-08-21 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-13701.
-
Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed

> JWTAuthPlugin calls authenticationFailure (which calls 
> HttpServletResponsesendError) before updating metrics - breaks tests
> ---
>
> Key: SOLR-13701
> URL: https://issues.apache.org/jira/browse/SOLR-13701
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: SOLR-13701.patch
>
>
> The way JWTAuthPlugin is currently implemented, any failures are sent to the 
> remote client (via {{authenticationFailure(...)}} which calls 
> {{HttpServletResponsesendError(...)}}) *before* 
> {{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics 
> (like {{numErrors}} and {{numWrongCredentials}})
> This causes a race condition in tests where test threads can:
>  * see an error response/Exception before the server thread has updated 
> metrics (like {{numErrors}} and {{numWrongCredentials}})
>  * call white box methods like 
> {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected 
> metrics
> ...all before the server thread has ever gotten around to being able to 
> update the metrics in question.
> {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some 
> {{"First metrics count assert failed, pausing 2s before re-attempt"}} 
> evidently to try and work around this bug, but it's still no garuntee that 
> the server thread will be scheduled before the retry happens.
> We can/should just fix JWTAuthPlugin to ensure the metrics are updated before 
> {{authenticationFailure(...)}} is called, and then remove the "pausing 2s 
> before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug 
> fix, and the existing work around for SOLR-13464, there should be absolutely 
> no reason to "retry" reading hte metrics.
> (NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method 
> that also calls {{HttpServletResponse.sendError(...)}} - but it already 
> (correctly) updates the error/failure metrics *before* calling that method.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified

2019-08-21 Thread Hoss Man (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-13700.
-
Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed

> Race condition in initializing metrics for new security plugins when 
> security.json is modified
> --
>
> Key: SOLR-13700
> URL: https://issues.apache.org/jira/browse/SOLR-13700
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: SOLR-13700.patch, SOLR-13700.patch
>
>
> When new security plugins are initialized due to remote API requetss, there 
> is a delay between "registering" the new plugins for use in methods like 
> {{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's 
> volatile {{this.authenticationPlugin}} variable) and when the 
> {{initializeMetrics(..)}} method is called on these plugins, so that they 
> continue to use the existing {{Metric}} instances as the plugins they are 
> replacing.
> Because these security plugins maintain local refrences to these Metrics (and 
> don't "get" them from the MetricRegisty everytime they need to {{inc()}} 
> them) this means that there is short race condition situation such that 
> during the window of time after a new plugin instance is put into use, but 
> before  {{initializeMetrics(..)}} is called on them, at these plugins are 
> responsible for accepting/rejecting requests, and decisions in {{Metric}} 
> instances that are not registered and subsequently get thrown away (and GCed) 
> once the CoreContainer gets around to calling {{initializeMetrics(..)}} (and 
> the plugin starts using the pre-existing metric objects)
> 
> This has some noticible impacts on auth tests on CPU constrained jenkins 
> machines (even after putting in place SOLR-13464 work arounds) that make 
> assertions about the metrics recorded.
> In real world situations, the impact of this bug on users is minor: for a few 
> micro/milli-seconds, requests may come in w/o being counted in the auth 
> metrics -- which may also result in descrepencies between the auth metrics 
> totals and the overall request metrics.  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13650) Support for named global classloaders

2019-08-19 Thread Hoss Man (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910870#comment-16910870
 ] 

Hoss Man commented on SOLR-13650:
-

broke precommit...

https://builds.apache.org/job/Lucene-Solr-Tests-master/3590/

{noformat}
[forbidden-apis] Forbidden method invocation: java.lang.String#getBytes() [Uses 
default charset]
[forbidden-apis]   in org.apache.solr.handler.TestContainerReqHandler 
(TestContainerReqHandler.java:586)
[forbidden-apis] Scanned 4366 class file(s) for forbidden API invocations (in 
6.99s), 1 error(s).
{noformat}


> Support for named global classloaders
> -
>
> Key: SOLR-13650
> URL: https://issues.apache.org/jira/browse/SOLR-13650
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:json}
> curl -X POST -H 'Content-type:application/json' --data-binary '
> {
>   "add-package": {
>"name": "my-package" ,
>   "url" : "http://host:port/url/of/jar;,
>   "sha512":""
>   }
> }' http://localhost:8983/api/cluster
> {code}
> This means that Solr creates a globally accessible classloader with a name 
> {{my-package}} which contains all the jars of that package. 
> A component should be able to use the package by using the {{"package" : 
> "my-package"}}.
> eg:
> {code:json}
> curl -X POST -H 'Content-type:application/json' --data-binary '
> {
>   "create-searchcomponent": {
>   "name": "my-searchcomponent" ,
>   "class" : "my.path.to.ClassName",
>  "package" : "my-package"
>   }
> }' http://localhost:8983/api/c/mycollection/config 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified

2019-08-16 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13700:

Attachment: SOLR-13700.patch
Status: Open  (was: Open)


I've updated the patch to move 
{{pkiAuthenticationPlugin.initializeMetrics((...)}} so that it is called 
exactly once immediately after {{new PKIAuthenticationPlugin(...)}}.

precommit now passes, i'm still beasting.



[~janhoy]: I don't understand this part of your comment...

bq. ... Re point 2, your patch deletes the wrong lines for auditloggerPlugin 
metrics. ...

The only lines modified in my patch(es) that mention {{auditloggerPlugin}} is 
to move the {{auditloggerPlugin.plugin.initializeMetrics(...)}} call into the 
existing {{initializeAuditloggerPlugin(...)}} as per the point of this jira.

Can you please elaborate on what lines you think were wrong to be deleted/moved 
?  ... ideally with a counter-patch, or suggested new case demonstrating the 
problem, so there's no ambiguity as to what you mean?


> Race condition in initializing metrics for new security plugins when 
> security.json is modified
> --
>
> Key: SOLR-13700
> URL: https://issues.apache.org/jira/browse/SOLR-13700
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13700.patch, SOLR-13700.patch
>
>
> When new security plugins are initialized due to remote API requetss, there 
> is a delay between "registering" the new plugins for use in methods like 
> {{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's 
> volatile {{this.authenticationPlugin}} variable) and when the 
> {{initializeMetrics(..)}} method is called on these plugins, so that they 
> continue to use the existing {{Metric}} instances as the plugins they are 
> replacing.
> Because these security plugins maintain local refrences to these Metrics (and 
> don't "get" them from the MetricRegisty everytime they need to {{inc()}} 
> them) this means that there is short race condition situation such that 
> during the window of time after a new plugin instance is put into use, but 
> before  {{initializeMetrics(..)}} is called on them, at these plugins are 
> responsible for accepting/rejecting requests, and decisions in {{Metric}} 
> instances that are not registered and subsequently get thrown away (and GCed) 
> once the CoreContainer gets around to calling {{initializeMetrics(..)}} (and 
> the plugin starts using the pre-existing metric objects)
> 
> This has some noticible impacts on auth tests on CPU constrained jenkins 
> machines (even after putting in place SOLR-13464 work arounds) that make 
> assertions about the metrics recorded.
> In real world situations, the impact of this bug on users is minor: for a few 
> micro/milli-seconds, requests may come in w/o being counted in the auth 
> metrics -- which may also result in descrepencies between the auth metrics 
> totals and the overall request metrics.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests

2019-08-16 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909241#comment-16909241
 ] 

Hoss Man commented on SOLR-13701:
-

the beasting i've done locally so far indicates that between the SOLR-13464 
work arounds and this fix in this patch there is no need for the 2s retry...

but until we actually remove it will be hard to know if it's hiding other bugs 
- because we have very little visibility in to how often jenkins builds are 
passing *only* because of that retry (test logs aren't kept of tests that PASS, 
so we can't grep for that log message to try and find other situations/bugs we 
don't currently know about.

> JWTAuthPlugin calls authenticationFailure (which calls 
> HttpServletResponsesendError) before updating metrics - breaks tests
> ---
>
> Key: SOLR-13701
> URL: https://issues.apache.org/jira/browse/SOLR-13701
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13701.patch
>
>
> The way JWTAuthPlugin is currently implemented, any failures are sent to the 
> remote client (via {{authenticationFailure(...)}} which calls 
> {{HttpServletResponsesendError(...)}}) *before* 
> {{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics 
> (like {{numErrors}} and {{numWrongCredentials}})
> This causes a race condition in tests where test threads can:
>  * see an error response/Exception before the server thread has updated 
> metrics (like {{numErrors}} and {{numWrongCredentials}})
>  * call white box methods like 
> {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected 
> metrics
> ...all before the server thread has ever gotten around to being able to 
> update the metrics in question.
> {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some 
> {{"First metrics count assert failed, pausing 2s before re-attempt"}} 
> evidently to try and work around this bug, but it's still no garuntee that 
> the server thread will be scheduled before the retry happens.
> We can/should just fix JWTAuthPlugin to ensure the metrics are updated before 
> {{authenticationFailure(...)}} is called, and then remove the "pausing 2s 
> before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug 
> fix, and the existing work around for SOLR-13464, there should be absolutely 
> no reason to "retry" reading hte metrics.
> (NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method 
> that also calls {{HttpServletResponse.sendError(...)}} - but it already 
> (correctly) updates the error/failure metrics *before* calling that method.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests

2019-08-15 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13701:

Attachment: SOLR-13701.patch
Status: Open  (was: Open)


Attaching patch that addressess this.  

I also updates the existing code paths that propogate the request via 
{{filterChain.doFilter(...)}} to ensure that the associated metrics ( 
{{numPassThrough}} and/or {{numAuthenticated}} ) are updated *before* 
{{filterChain.doFilter(...)}} is called, so that they are correct even if a 
subsequent filter (or ultimately, the SolrCore/RequestHandler) encounters an 
error or otherwise rejects the request.

[~janhoy] - would appreciate if you could review.

> JWTAuthPlugin calls authenticationFailure (which calls 
> HttpServletResponsesendError) before updating metrics - breaks tests
> ---
>
> Key: SOLR-13701
> URL: https://issues.apache.org/jira/browse/SOLR-13701
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13701.patch
>
>
> The way JWTAuthPlugin is currently implemented, any failures are sent to the 
> remote client (via {{authenticationFailure(...)}} which calls 
> {{HttpServletResponsesendError(...)}}) *before* 
> {{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics 
> (like {{numErrors}} and {{numWrongCredentials}})
> This causes a race condition in tests where test threads can:
>  * see an error response/Exception before the server thread has updated 
> metrics (like {{numErrors}} and {{numWrongCredentials}})
>  * call white box methods like 
> {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected 
> metrics
> ...all before the server thread has ever gotten around to being able to 
> update the metrics in question.
> {{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some 
> {{"First metrics count assert failed, pausing 2s before re-attempt"}} 
> evidently to try and work around this bug, but it's still no garuntee that 
> the server thread will be scheduled before the retry happens.
> We can/should just fix JWTAuthPlugin to ensure the metrics are updated before 
> {{authenticationFailure(...)}} is called, and then remove the "pausing 2s 
> before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug 
> fix, and the existing work around for SOLR-13464, there should be absolutely 
> no reason to "retry" reading hte metrics.
> (NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method 
> that also calls {{HttpServletResponse.sendError(...)}} - but it already 
> (correctly) updates the error/failure metrics *before* calling that method.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13701) JWTAuthPlugin calls authenticationFailure (which calls HttpServletResponsesendError) before updating metrics - breaks tests

2019-08-15 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13701:
---

 Summary: JWTAuthPlugin calls authenticationFailure (which calls 
HttpServletResponsesendError) before updating metrics - breaks tests
 Key: SOLR-13701
 URL: https://issues.apache.org/jira/browse/SOLR-13701
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
Assignee: Hoss Man


The way JWTAuthPlugin is currently implemented, any failures are sent to the 
remote client (via {{authenticationFailure(...)}} which calls 
{{HttpServletResponsesendError(...)}}) *before* 
{{JWTAuthPlugin.doAuthenticate(...)}} has a chance to update it's metrics (like 
{{numErrors}} and {{numWrongCredentials}})

This causes a race condition in tests where test threads can:
 * see an error response/Exception before the server thread has updated metrics 
(like {{numErrors}} and {{numWrongCredentials}})
 * call white box methods like 
{{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} to assert expected 
metrics

...all before the server thread has ever gotten around to being able to update 
the metrics in question.

{{SolrCloudAuthTestCase.assertAuthMetricsMinimums(...)}} currently has some 
{{"First metrics count assert failed, pausing 2s before re-attempt"}} evidently 
to try and work around this bug, but it's still no garuntee that the server 
thread will be scheduled before the retry happens.

We can/should just fix JWTAuthPlugin to ensure the metrics are updated before 
{{authenticationFailure(...)}} is called, and then remove the "pausing 2s 
before re-attempt" logic from {{SolrCloudAuthTestCase}} - between this bug fix, 
and the existing work around for SOLR-13464, there should be absolutely no 
reason to "retry" reading hte metrics.

(NOTE: BasicAuthPlugin has a similar {{authenticationFailure(...)}} method that 
also calls {{HttpServletResponse.sendError(...)}} - but it already (correctly) 
updates the error/failure metrics *before* calling that method.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified

2019-08-15 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13700:

  Assignee: Hoss Man
Attachment: SOLR-13700.patch
Status: Open  (was: Open)


Attaching a patch to address this, one nocommit regarding something that makes 
no sense to me...

[~ab] - can you please review and sanity check:

# that my understanding is correct, and that it's "safe" (and more correct) for 
the "old" and "new" instances of these plugins to be using the same Metric 
instances before the "new" plugin replaces the old one
# the nocommit comments -- unless i'm missing something 
{{reloadSecurityProperties()}} has no business calling 
{{pkiAuthenticationPlugin.initializeMetrics(...)}}, because 
{{pkiAuthenticationPlugin}} can never change as a result of reloading the 
security.json ... so {{pkiAuthenticationPlugin.initializeMetrics(...)}} should 
be called exactly once (and only once) for it's entire lifecycle ... ideally in 
{{CoreContainer.load()}} immediately after calling {{pkiAuthenticationPlugin = 
new PKIAuthenticationPlugin...)}}



> Race condition in initializing metrics for new security plugins when 
> security.json is modified
> --
>
> Key: SOLR-13700
> URL: https://issues.apache.org/jira/browse/SOLR-13700
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13700.patch
>
>
> When new security plugins are initialized due to remote API requetss, there 
> is a delay between "registering" the new plugins for use in methods like 
> {{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's 
> volatile {{this.authenticationPlugin}} variable) and when the 
> {{initializeMetrics(..)}} method is called on these plugins, so that they 
> continue to use the existing {{Metric}} instances as the plugins they are 
> replacing.
> Because these security plugins maintain local refrences to these Metrics (and 
> don't "get" them from the MetricRegisty everytime they need to {{inc()}} 
> them) this means that there is short race condition situation such that 
> during the window of time after a new plugin instance is put into use, but 
> before  {{initializeMetrics(..)}} is called on them, at these plugins are 
> responsible for accepting/rejecting requests, and decisions in {{Metric}} 
> instances that are not registered and subsequently get thrown away (and GCed) 
> once the CoreContainer gets around to calling {{initializeMetrics(..)}} (and 
> the plugin starts using the pre-existing metric objects)
> 
> This has some noticible impacts on auth tests on CPU constrained jenkins 
> machines (even after putting in place SOLR-13464 work arounds) that make 
> assertions about the metrics recorded.
> In real world situations, the impact of this bug on users is minor: for a few 
> micro/milli-seconds, requests may come in w/o being counted in the auth 
> metrics -- which may also result in descrepencies between the auth metrics 
> totals and the overall request metrics.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13700) Race condition in initializing metrics for new security plugins when security.json is modified

2019-08-15 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13700:
---

 Summary: Race condition in initializing metrics for new security 
plugins when security.json is modified
 Key: SOLR-13700
 URL: https://issues.apache.org/jira/browse/SOLR-13700
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man


When new security plugins are initialized due to remote API requetss, there is 
a delay between "registering" the new plugins for use in methods like 
{{initializeAuthenticationPlugin()}} (by assigning them to CoreContainer's 
volatile {{this.authenticationPlugin}} variable) and when the 
{{initializeMetrics(..)}} method is called on these plugins, so that they 
continue to use the existing {{Metric}} instances as the plugins they are 
replacing.

Because these security plugins maintain local refrences to these Metrics (and 
don't "get" them from the MetricRegisty everytime they need to {{inc()}} them) 
this means that there is short race condition situation such that during the 
window of time after a new plugin instance is put into use, but before  
{{initializeMetrics(..)}} is called on them, at these plugins are responsible 
for accepting/rejecting requests, and decisions in {{Metric}} instances that 
are not registered and subsequently get thrown away (and GCed) once the 
CoreContainer gets around to calling {{initializeMetrics(..)}} (and the plugin 
starts using the pre-existing metric objects)



This has some noticible impacts on auth tests on CPU constrained jenkins 
machines (even after putting in place SOLR-13464 work arounds) that make 
assertions about the metrics recorded.

In real world situations, the impact of this bug on users is minor: for a few 
micro/milli-seconds, requests may come in w/o being counted in the auth metrics 
-- which may also result in descrepencies between the auth metrics totals and 
the overall request metrics.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13696) DimensionalRoutedAliasUpdateProcessorTest / RoutedAliasUpdateProcessorTest failures due commitWithin/openSearcher delays

2019-08-14 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907644#comment-16907644
 ] 

Hoss Man commented on SOLR-13696:
-

Gus: can you please take a look at this?

based on my assessment, here's the crucial bits of the log..
{noformat}
hossman@tray:~/tmp/jenkins/DimensionalRoutedAliasUpdateProcessorTest$ grep 
testTimeCat__TRA__2019-07-05__CRA__calico 
thetaphi_Lucene-Solr-8.x-MacOSX_272.log.txt | egrep '(Opening 
\[Searcher|add=\[21|fq=cat_s:calico|\{\!terms\+f%3Did}21,20.*hits=2)'
   [junit4]   2> 4476175 INFO  (qtp759508539-75005) [n:127.0.0.1:55915_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard1 r:core_node5 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n2 ] 
o.a.s.s.SolrIndexSearcher Opening 
[Searcher@9bc49f7[testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n2] 
main]
   [junit4]   2> 4476176 INFO  (qtp1536738594-75022) [n:127.0.0.1:55916_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard1 r:core_node3 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1 ] 
o.a.s.s.SolrIndexSearcher Opening 
[Searcher@5547583d[testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1] 
main]
   [junit4]   2> 4476186 INFO  (qtp1998715126-75500) [n:127.0.0.1:55917_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] 
o.a.s.s.SolrIndexSearcher Opening 
[Searcher@3d40f0e1[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7] 
main]
   [junit4]   2> 4476195 INFO  (qtp927691752-75020) [n:127.0.0.1:55918_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node6 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4 ] 
o.a.s.s.SolrIndexSearcher Opening 
[Searcher@18c82bc1[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4] 
main]
   [junit4]   2> 4477375 INFO  (qtp1998715126-75016) [n:127.0.0.1:55917_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] 
o.a.s.u.p.LogUpdateProcessorFactory 
[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7]  webapp=/solr 
path=/update 
params={update.distrib=FROMLEADER=http://127.0.0.1:55918/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4/=javabin=2}{add=[21
 (1641811095092985856)]} 0 2
   [junit4]   2> 4477960 INFO  (qtp927691752-75506) [n:127.0.0.1:55918_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node6 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4 ] 
o.a.s.u.p.LogUpdateProcessorFactory 
[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4]  webapp=/solr 
path=/update 
params={update.distrib=NONE=_text_=TOLEADER=http://127.0.0.1:55918/solr/testTimeCat__TRA__2019-07-02__CRA__calico_shard2_replica_n6/=javabin=2=inc}{add=[21
 (1641811095092985856)]} 0 590
   [junit4]   2> 4477962 INFO  (commitScheduler-24384-thread-1) [ ] 
o.a.s.s.SolrIndexSearcher Opening 
[Searcher@745b6c94[testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4] 
main]
   [junit4]   2> 4478213 INFO  (qtp1998715126-75501) [n:127.0.0.1:55917_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] 
o.a.s.c.S.Request [testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7] 
 webapp=/solr path=/select 
params={q={!terms+f%3Did}21,20=0=javabin=2} hits=2 status=0 
QTime=13
   [junit4]   2> 4478408 INFO  (qtp1998715126-75016) [n:127.0.0.1:55917_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard2 r:core_node8 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7 ] 
o.a.s.c.S.Request [testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7] 
 webapp=/solr path=/select 
params={df=_text_=false=id=score=516=0=true=cat_s:calico=http://127.0.0.1:55917/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n7/|http://127.0.0.1:55918/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard2_replica_n4/=0=2=*:*=true=false=1565753074817=true=javabin=timestamp_dt}
 hits=0 status=0 QTime=0
   [junit4]   2> 4478408 INFO  (qtp1536738594-75032) [n:127.0.0.1:55916_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard1 r:core_node3 
x:testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1 ] 
o.a.s.c.S.Request [testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1] 
 webapp=/solr path=/select 
params={df=_text_=false=id=score=516=0=true=cat_s:calico=http://127.0.0.1:55916/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n1/|http://127.0.0.1:55915/solr/testTimeCat__TRA__2019-07-05__CRA__calico_shard1_replica_n2/=0=2=*:*=true=false=1565753074817=true=javabin=timestamp_dt}
 hits=0 status=0 QTime=0
   [junit4]   2> 4478408 INFO  (qtp1536738594-75031) [n:127.0.0.1:55916_solr 
c:testTimeCat__TRA__2019-07-05__CRA__calico s:shard1 r:core_node3 

[jira] [Created] (SOLR-13696) DimensionalRoutedAliasUpdateProcessorTest / RoutedAliasUpdateProcessorTest failures due commitWithin/openSearcher delays

2019-08-14 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13696:
---

 Summary: DimensionalRoutedAliasUpdateProcessorTest / 
RoutedAliasUpdateProcessorTest failures due commitWithin/openSearcher delays
 Key: SOLR-13696
 URL: https://issues.apache.org/jira/browse/SOLR-13696
 Project: Solr
  Issue Type: Test
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
Assignee: Gus Heck
 Attachments: thetaphi_Lucene-Solr-8.x-MacOSX_272.log.txt

Recent jenkins failure...
Build: https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-MacOSX/272/
Java: 64bit/jdk1.8.0 -XX:-UseCompressedOops -XX:+UseParallelGC
{noformat}
Stack Trace:
java.lang.AssertionError: expected:<16> but was:<15>
at 
__randomizedtesting.SeedInfo.seed([DB6DC28D5560B1D2:E295833E1541FDB9]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at
org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest.assertCatTimeInvariants(DimensionalRoutedAliasUpdateProcessorTest.java:677
)
at 
org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest.testTimeCat(DimensionalRoutedAliasUpdateProcessorTest.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{noformat}

Digging into the logs, the problem appears to be in the way the test 
verifies/assumes docs have been committed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-13688) Make the bin/solr export command to run one thread per shard

2019-08-13 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened SOLR-13688:
-

8x doesn't compile...

https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-MacOSX/271/
{noformat}
Build Log:
[...truncated 12279 lines...]
[javac] Compiling 1284 source files to 
/Users/jenkins/workspace/Lucene-Solr-8.x-MacOSX/solr/build/solr-core/classes/java
[javac] 
/Users/jenkins/workspace/Lucene-Solr-8.x-MacOSX/solr/core/src/java/org/apache/solr/util/ExportTool.java:312:
 error: cannot infer type
arguments for BiConsumer
[javac] private BiConsumer bic= new BiConsumer<>() {
[javac]   ^
[javac]   reason: '<>' with anonymous inner classes is not supported in 
-source 8
[javac] (use -source 9 or higher to enable '<>' with anonymous inner 
classes)
[javac]   where T,U are type-variables:
[javac] T extends Object declared in interface BiConsumer
[javac] U extends Object declared in interface BiConsumer
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error


{noformat}

> Make the bin/solr export command to run one thread per shard
> 
>
> Key: SOLR-13688
> URL: https://issues.apache.org/jira/browse/SOLR-13688
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.3
>
>
> This can be run in parallel with one dedicated thread for each shard and 
> (distrib=false) option
> this will be the only option



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13694) IndexSizeEstimator NullPointerException

2019-08-13 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13694:
---

 Summary: IndexSizeEstimator NullPointerException
 Key: SOLR-13694
 URL: https://issues.apache.org/jira/browse/SOLR-13694
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
Assignee: Andrzej Bialecki 



Jenkins found a reproducible seed for trigging an NPE in IndexSizeEstimatorTest

Based on a little experimental tracing i did, this might be a real bug in 
IndexSizeEstimator? ... it's calling close on StoredFieldsReader instances it 
gets from the CodecReader -- but AFAICT from the docs/code i'm not certain if 
it should be doing this.  It appears the expectation is that this is direct 
access to the internal state, that will automatically be closed when the 
CodecReader is closed.

ie: IndexSizeEstimator is closing StoredFieldsReader pre-maturely, causing it 
to be unusbale on the next iteration.

(I didn't dig in far enough to guess if there are other places in the 
IndexSizeEstimator code that are closing CodecReader internals prematurely as 
well, or just in this situation ... it's also not clear if this only causes 
failures because this seed uses SimpleTextCodec, and other codecs are more 
forgiving -- or if something else about the index(es) generated for this seed 
are what cause the problem to manifest)

http://fucit.org/solr-jenkins-reports/job-data/apache/Lucene-Solr-NightlyTests-master/1928
{noformat}
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ git rev-parse HEAD
0291db44bc8e092f7cb2f577f0ac8ab6fa6a5fd7
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test  
-Dtestcase=IndexSizeEstimatorTest -Dtests.method=testEstimator 
-Dtests.seed=23F60434E13D8FD4 -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true  -Dtests.locale=eo -Dtests.timezone=Atlantic/Madeira 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
...
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=IndexSizeEstimatorTest -Dtests.method=testEstimator 
-Dtests.seed=23F60434E13D8FD4 -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=eo 
-Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.88s | IndexSizeEstimatorTest.testEstimator <<<
   [junit4]> Throwable #1: java.lang.NullPointerException
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([23F60434E13D8FD4:EC2B6B666D451E64]:0)
   [junit4]>at 
org.apache.lucene.codecs.simpletext.SimpleTextStoredFieldsReader.visitDocument(SimpleTextStoredFieldsReader.java:109)
   [junit4]>at 
org.apache.solr.handler.admin.IndexSizeEstimator.estimateStoredFields(IndexSizeEstimator.java:513)
   [junit4]>at 
org.apache.solr.handler.admin.IndexSizeEstimator.estimate(IndexSizeEstimator.java:198)
   [junit4]>at 
org.apache.solr.handler.admin.IndexSizeEstimatorTest.testEstimator(IndexSizeEstimatorTest.java:117)
   [junit4]>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]>at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]>at 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]>at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13464) no way for external clients to detect when changes to security config have taken effect

2019-08-12 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13464:

Description: 
The basic functionality of the authorization/authentication REST APIs works by 
persisting changes to a {{security.json}} file in ZooKeeper which is monitored 
by every node via a Watcher.  When the watchers fire, the affected plugin types 
are (re)-initialized ith the new settings.
Since this information is "pulled" from ZK by the nodes, there is a (small) 
inherent delay between when the REST API is hit by external clients, and when 
each node learns of the changes.  An additional delay exists as the config is 
"reloaded" to (re)initialize the plugins.

Practically speaking these delays have very little impact on a "real" solr 
cloud cluster, but they can be problematic in test cases -- while the 
SecurityConfHandler on each node could be used to query the "current" 
security.json file, it doesn't indicate if/when the plugins identified in the 
"current" configuration are fully in use.

For now, we have a "white box" work around available for MiniSolrCloudCluster 
based tests by comparing the Plugins of each CoreContainer in use before and 
after making known changes via the API (see commits identified below).

This issue exists as a placeholder for future consideration of UX/API 
improvements making it easier for external clients (w/o "white box" access to 
solr internals) to know definitively if/when modified security settings take 
effect.


{panel:title=original jira description}
I've been investigating some sporadic and hard to reproduce test failures 
related to authentication in cloud mode, and i *think* (but have not directly 
verified) that the common cause is that after uses one of the 
{{/admin/auth...}} handlers to update some setting, there is an inherient and 
unpredictible delay (due to ZK watches) until every node in the cluster has had 
a chance to (re)load the new configuration and initialize the various security 
plugins with the new settings.

Which means, if a test client does a POST to some node to add/change/remove 
some authn/authz settings, and then immediately hits the exact same node (or 
any other node) to test that the effects of those settings exist, there is no 
garuntee that they will have taken affect yet.
{panel}

  was:
I've been investigating some sporadic and hard to reproduce test failures 
related to authentication in cloud mode, and i *think* (but have not directly 
verified) that the common cause is that after uses one of the 
{{/admin/auth...}} handlers to update some setting, there is an inherient and 
unpredictible delay (due to ZK watches) until every node in the cluster has had 
a chance to (re)load the new configuration and initialize the various security 
plugins with the new settings.

Which means, if a test client does a POST to some node to add/change/remove 
some authn/authz settings, and then immediately hits the exact same node (or 
any other node) to test that the effects of those settings exist, there is no 
garuntee that they will have taken affect yet.


 Issue Type: Improvement  (was: Bug)
Summary: no way for external clients to detect when changes to security 
config have taken effect  (was: Sporadic Auth + Cloud test failures, probably 
due to lag in nodes reloading security config)

since i was able to come up with a test workaround, i've shifted the type, 
summary, and description of this Jira to focus on future UX/API improvements 
for external clients

> no way for external clients to detect when changes to security config have 
> taken effect
> ---
>
> Key: SOLR-13464
> URL: https://issues.apache.org/jira/browse/SOLR-13464
> Project: Solr
>  Issue Type: Improvement
>Reporter: Hoss Man
>Priority: Major
>
> The basic functionality of the authorization/authentication REST APIs works 
> by persisting changes to a {{security.json}} file in ZooKeeper which is 
> monitored by every node via a Watcher.  When the watchers fire, the affected 
> plugin types are (re)-initialized ith the new settings.
> Since this information is "pulled" from ZK by the nodes, there is a (small) 
> inherent delay between when the REST API is hit by external clients, and when 
> each node learns of the changes.  An additional delay exists as the config is 
> "reloaded" to (re)initialize the plugins.
> Practically speaking these delays have very little impact on a "real" solr 
> cloud cluster, but they can be problematic in test cases -- while the 
> SecurityConfHandler on each node could be used to query the "current" 
> security.json file, it doesn't indicate if/when the plugins identified in the 
> "current" configuration are fully in use.
> For now, we have a "white box" work around available for 

[jira] [Commented] (SOLR-9658) Caches should have an optional way to clean if idle for 'x' mins

2019-08-09 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904200#comment-16904200
 ] 

Hoss Man commented on SOLR-9658:


* i should have noticed/mentioned this in the last patch but: any method 
(including your new {{markAndSweepByIdleTime()}} that that expects to be called 
only when markAndSweepLock is already held should really start with {{assert 
markAndSweepLock.isHeldByCurrentThread();}}

 * this patch still seems to modify TestJavaBinCodec unneccessarily? (now that 
you re-added the backcompat constructor)

 * i don't really think it's a good idea to add these {{CacheListener}} / 
{{EvictionListener}} APIs at this point w/o a lot more consideration of their 
lifecycle / usage
 ** I know you introduced them in response to my suggestion to add hooks for 
monitoring in tests, but they don't _currently_ seem more useful in the tests 
then some of the specific suggestions i made before (more comments on this 
below) and the APIs don't seem to be thought through enough to be generally 
useful later w/o a lot of re-working...
 *** Examples: if the point of creating {{CacheListener}} now is to be able to 
add more methods/hooks to it later, then is why is only {{EvictionListener}} 
passed down to the {{ConcurrentXXXCache}} impls instead of the entire 
{{CacheListener}} ?
 *** And why are there 2 distinct {{EvictionListener}} interfaces, instead of 
just a common one?
 ** ... so it would probably be safer/cleaner to avoid adding these APIs now 
since there are simpler alternatives available for the tests?

 * Re: "...plus adding support for artificially "advancing" the time" ... this 
seems overly complex?
 ** None of the suggestions i made for improving the reliability/coverage of 
the test require needing to fake the "now" clock: just being able to insert 
synthetic entries into the cache with artifically old timestamps – which could 
be done by refactoring out the middle of {{put(...)}} into a new 
{{putCacheEntry(CacheEntry ... )}} method that would let the (test) caller set 
an arbitrary {{lastAccessed}} value...
{code:java}
/**
 * Useable by tests to create synthetic cache entries, also called by {@link 
#put}
 * @lucene.internal
 */
 public CacheEntry putCacheEntry(CacheEntry e) {
CacheEntry oldCacheEntry = map.put(key, e);
int currentSize;
if (oldCacheEntry == null) {
  currentSize = stats.size.incrementAndGet();
  ramBytes.addAndGet(e.ramBytesUsed() + HASHTABLE_RAM_BYTES_PER_ENTRY); // 
added key + value + entry
} else {
  currentSize = stats.size.get();
  ramBytes.addAndGet(-oldCacheEntry.ramBytesUsed());
  ramBytes.addAndGet(e.ramBytesUsed());
}
if (islive) {
  stats.putCounter.increment();
} else {
  stats.nonLivePutCounter.increment();
}
return oldCacheEntry;
 }
{code}

 ** ...that way tests could "setup" a cache containing arbitrary entries (of 
arbitrary size, with arbitrary create/access times that could be from weeks in 
the past) and then very precisely inspect the results of the cache after 
calling {{markAndSweep()}}
 *** or some other new {{triggerCleanupIfNeeded()}} method that can encapsualte 
all of the existing {{// Check if we need to clear out old entries from the 
cache ...}} logic currently at the end of {{put()}}

 * In general, i really think testing of functionality like this should really 
focus on testing "what exactly happens when markAndSeep() is called on a cache 
containing a very specific set of values?" indepdnent from "does markAndSweep() 
get called eventually & automatically if i configure maxIdleTime?"
 ** the former can be tested w/o the need of any cleanup threads or faking the 
TimeSource
 ** the later can be tested w/o the need of a {{CacheListener}} or 
{{EvictionListener}} API (or a fake TimeSource) – just create an anonymous 
subclass of {{ConcurrentXXXCache}} who'se markAndSweep() method decrements a 
CountDownLatch tha the test tread is waiting on
 ** isolating the testing of these different concepts not only makes it easier 
to test more complex aspects of how {{markAndSweep()}} is expected to work (ie: 
"assert exactly which entries are removed if the sum of the sizes == X == 
(ramUpperWatermark + Y) but the two smallest entries (whose total size = Y + 1) 
are the only one with an accessTime older then the idleTime") but also makes it 
easier to understand & debug failures down the road -if- _when_ they happen.
 *** as things stand in your patch, -if- _when_ the "did not evict entries in 
time" assert (eventually) trips in a future jenkins build, we won't immediately 
be able to tell (w/o added logging) if that's because of a bug in the 
{{CleanupThread}} that prevented if from calling {{markAndSweep();}} or a bug 
in {{SimTimeSource.advanceMs()}} ; or a bug somewhere in the cache that 
prevented {{markAndSweep()}} from recognizing those entries were old; or just a 
heavily loaded VM CPU 

[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-08-08 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903409#comment-16903409
 ] 

Hoss Man commented on SOLR-13399:
-

i would assume it's related to the (numSubShards) changes in SplitShardCmd ?

At first glance, that code path looks like it's specific to SPLIT_BY_PREFIX, 
but apparently your previous commit has it defaulting to "true" ? (see 
SplitShardCmd.java L212)
{noformat}
$ git show 19ddcfd282f3b9eccc50da83653674e510229960 -- 
core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java | cat
commit 19ddcfd282f3b9eccc50da83653674e510229960
Author: yonik 
Date:   Tue Aug 6 14:09:54 2019 -0400

SOLR-13399: ability to use id field for compositeId histogram

diff --git 
a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java 
b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
index 4d623be..6c5921e 100644
--- 
a/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
+++ 
b/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java
@@ -212,16 +212,14 @@ public class SplitShardCmd implements 
OverseerCollectionMessageHandler.Cmd {
   if (message.getBool(CommonAdminParams.SPLIT_BY_PREFIX, true)) {
 t = timings.sub("getRanges");
 
-log.info("Requesting split ranges from replica " + 
parentShardLeader.getName() + " as part of slice " + slice + " of collection "
-+ collectionName + " on " + parentShardLeader);
-
 ModifiableSolrParams params = new ModifiableSolrParams();
 params.set(CoreAdminParams.ACTION, 
CoreAdminParams.CoreAdminAction.SPLIT.toString());
 params.set(CoreAdminParams.GET_RANGES, "true");
 params.set(CommonAdminParams.SPLIT_METHOD, splitMethod.toLower());
 params.set(CoreAdminParams.CORE, parentShardLeader.getStr("core"));
-int numSubShards = message.getInt(NUM_SUB_SHARDS, 
DEFAULT_NUM_SUB_SHARDS);
-params.set(NUM_SUB_SHARDS, Integer.toString(numSubShards));
+// Only 2 is currently supported
+// int numSubShards = message.getInt(NUM_SUB_SHARDS, 
DEFAULT_NUM_SUB_SHARDS);
+// params.set(NUM_SUB_SHARDS, Integer.toString(numSubShards));
 
 {
   final ShardRequestTracker shardRequestTracker = 
ocmh.asyncRequestTracker(asyncId);
@@ -236,7 +234,7 @@ public class SplitShardCmd implements 
OverseerCollectionMessageHandler.Cmd {
 NamedList shardRsp = (NamedList)successes.getVal(0);
 String splits = (String)shardRsp.get(CoreAdminParams.RANGES);
 if (splits != null) {
-  log.info("Resulting split range to be used is " + splits);
+  log.info("Resulting split ranges to be used: " + splits + " 
slice=" + slice + " leader=" + parentShardLeader);
   // change the message to use the recommended split ranges
   message = message.plus(CoreAdminParams.RANGES, splits);
 }

{noformat}
 

 (I could be totally of base though -- i don't really understand 90% of what 
this test is doing, and the place where it fails doesn't seem to be trying to 
split into more then 2 subshards, so even if the SplitSHardCmd changes i 
pointed out are buggy, i'm not sure why it would cause this particular failure)

 

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch, 
> SOLR-13399_testfix.patch, SOLR-13399_useId.patch, 
> ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc 

[jira] [Updated] (SOLR-13399) compositeId support for shard splitting

2019-08-08 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13399:

Attachment: ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt
Status: Reopened  (was: Reopened)


git bisect has identified 19ddcfd282f3b9eccc50da83653674e510229960 as the cause 
of recent (reproducible) jenkins test failures in ShardSplitTest...

https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-NightlyTests-8.x/174/
https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-repro/3507/

(Jenkins found the failures on branch_8x, but i was able to reproduce the same 
exact seed on master, and used that branch for bisecting.  Attaching logs from 
my local master run.)

{noformat}
ant test -Dtestcase=ShardSplitTest -Dtests.method=test 
-Dtests.seed=AE04B5C9BA6E9A4 -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true -Dtests.badapples=true  -Dtests.locale=sr-Latn 
-Dtests.timezone=Etc/GMT-11 -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
{noformat}

{noformat}
   [junit4] FAILURE  273s J2 | ShardSplitTest.test <<<
   [junit4]> Throwable #1: java.lang.AssertionError: Wrong doc count on 
shard1_0. See SOLR-5309 expected:<257> but was:<316>
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([AE04B5C9BA6E9A4:82B47486355A845C]:0)
   [junit4]>at 
org.apache.solr.cloud.api.collections.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:1002)
   [junit4]>at 
org.apache.solr.cloud.api.collections.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:794)
   [junit4]>at 
org.apache.solr.cloud.api.collections.ShardSplitTest.test(ShardSplitTest.java:111)
   [junit4]>at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:1082)
   [junit4]>at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:1054)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
{noformat}


> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch, 
> SOLR-13399_testfix.patch, SOLR-13399_useId.patch, 
> ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9658) Caches should have an optional way to clean if idle for 'x' mins

2019-08-07 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902561#comment-16902561
 ] 

Hoss Man commented on SOLR-9658:


* i don't see anything that updates {{oldestEntryNs}} except 
{{markAndSweepByIdleTime}} ?
 ** this means that {{markAndSweep()}} may unneccessarily call 
{{markAndSweepByIdleTime()}} (looping over every entry) even if everything 
older then the maxIdleTime has already been purged by earlier method calls like 
{{markAndSweepByCacheSize()}} or {{markAndSweepByRamSize()}}
 ** off the top of my head, i can't think of an efficient way to "update" 
{{oldestEntryNs}} in some place like {{postRemoveEntry()}} w/o scanning every 
cache entry again, but...
 ** why not move {{markAndSweepByIdleTime()}} _before_ 
{{markAndSweepByCacheSize()}} and {{markAndSweepByRamSize()}} ?
 *** since the {{postRemoveEntry()}} calls made as a result of any eviction due 
to idle time *can* (and already do) efficiently update the results of 
{{size()}} and {{ramBytesUsed()}} that could potentially save the need for 
those additional scans of the cache in many situations.

 * rather then complicating the patch by changing the constructor of the 
{{CleanupThread}} class(es) to take in the maxIdle values directly, why not 
read that info from a (new) method on the ConcurrentXXXCache objects already 
passed to the constructors?
 ** with some small tweaks to the while loop, the {{wait()}} call could actual 
read this value dynamically from the cache element, eliminating the need to 
call {{setRunCleanupThread()}} from inside {{setMaxIdleTime()}} in the event 
that the value is changed dynamically.
 *** which is currently broken anyway since {{setRunCleanupThread()}} is 
currently a No-Op if {{this.runCleanupThread}} is true and {{cleanupThread}} is 
already non-null.
 ** assuming {{CleanupThread}} is changed to dynamically read the maxIdleTime 
directly from the cache, {{setMaxIdleTime()}} could just call {{wakeThread()}} 
if the new maxIdleTime is less then the old maxIdleTime
 *** or leave the call to {{setRunCleanupThread()}} as is, but change the {{if 
(cleanupThread == null)}} condition of {{setRunCleanupThread()}} to have an 
"else" code path that calls {{wakeThread()}} so it will call {{markAndSweep()}} 
(with the udpated settings) and then re-wait (with the new maxIdleTime)

 * although not likely to be problematic in practice, you've broken backcompat 
on the public "ConcurrentXXXCache" class(es) by adding an arg to the 
constructor.
 ** i would suggest adding a new constructor instead, and making the old one 
call the new one with "-1" – if for no other reason then to simplify the touch 
points / discussion in the patch...
 ** ie: in order to make this change, you had to modify both 
{{TestJavaBinCodec}} and {{TemplateUpdateProcessorFactory}} – but you wound up 
not using a backcompat equivilent value in {{TemplateUpdateProcessorFactory}} 
so your changes actually modify the behavior of that (end user facing class) in 
an undocumented way (that users can't override, and may actually have some 
noticible performance impacts on "put" since that existing usage doens't 
involve the cleanup thread) which should be discussed before committing (but 
are largely unrelated to the goals in this jira)

 * under no circumstances should we be committing new test code that makes 
arbitrary {{Thread.sleep(5000)}} calls
 ** i am willing to say categorically that this approach: DOES. NOT. WORK. – 
and has represented an overwelming percentage of the root causes of our tests 
being unreliable

 *** there is no garuntee the JVM will sleep as long as you ask it too 
(particularly on virtual hardware)
 *** there is no garuntee that "background threads/logic" will be 
scheduled/finished during the "sleep"
 ** it is far better to add whatever {{@lucene.internal}} methods we need to 
"hook into" the core code from test code and have white-box / grey-box tests 
that ensure methods get called when we expect, ex:
 *** if we want to test that the user level configuration results in the 
appropriate values being set on the underlying objects, we should add public 
getter methods for those values to those classes, and have the test reach into 
the SolrCore to get those objects and assert the expected results on those 
methods (NOT just "wait" to see the code run and have the expected side effects)
 *** if we want to test that {{ConcurrentXXXCache.markAndSweep()}} gets called 
by the {{CleanupThread}} _eventually_ when maxIdle time is configured even if 
nothing calls {{wakeThread()}} then we should use a mock/subclass of the 
ConcurrentXXXCache that overrides {{markAndSweep()}} to set a latch that we can 
{{await(...)}} on from the test code.
 *** if we want to test that calls to {{ConcurrentXXXCache.markAndSweep()}} 
result in items being removed if their {{createTime}} is "too old" then we 
should add a special internal only version of 

[jira] [Reopened] (SOLR-13622) Add FileStream Streaming Expression

2019-08-07 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened SOLR-13622:
-

StreamExpressionTest.testFileStreamDirectoryCrawl seems to make filesystem 
specific assumptions that fail hard on windows..

{noformat}
FAILED:  
org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testFileStreamDirectoryCrawl

Error Message:
expected: but was:

Stack Trace:
org.junit.ComparisonFailure: expected: but 
was:
at 
__randomizedtesting.SeedInfo.seed([92C40A8131F8CF7D:362DC46DFDF7A898]:0)
at org.junit.Assert.assertEquals(Assert.java:115)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.solr.client.solrj.io.stream.StreamExpressionTest.testFileStreamDirectoryCrawl(StreamExpressionTest.java:3128)

{noformat}

> Add FileStream Streaming Expression
> ---
>
> Key: SOLR-13622
> URL: https://issues.apache.org/jira/browse/SOLR-13622
> Project: Solr
>  Issue Type: New Feature
>  Components: streaming expressions
>Reporter: Joel Bernstein
>Assignee: Jason Gerlowski
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13622.patch, SOLR-13622.patch
>
>
> The FileStream will read files from a local filesystem and Stream back each 
> line of the file as a tuple.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13678) ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent zkCallback thread on props watcher

2019-08-02 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899236#comment-16899236
 ] 

Hoss Man commented on SOLR-13678:
-

AFAICT CollectionPropWatcher isn't used internally by solr anywhere, so this 
issue will only ipmact solr clients that explicitly register their own watchers.
/cc [~tomasflobbe] & [~prusko] and linking to SOLR-11960 where this was 
ntroduced.

> ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent 
> zkCallback thread on props watcher
> --
>
> Key: SOLR-13678
> URL: https://issues.apache.org/jira/browse/SOLR-13678
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: collectionpropswatcher-deadlock-jstack.txt
>
>
> while investigating an (unrelated) test bug in CollectionPropsTest I 
> discovered a deadlock situation that can occur when calling 
> {{ZkStateReader.removeCollectionPropsWatcher()}} if a zkCallback thread tries 
> to concurrently fire the watchers set on the collection props.
> {{ZkStateReader.removeCollectionPropsWatcher()}} is itself called when a 
> {{CollectionPropsWatcher.onStateChanged()}} impl returns "true" -- meaning 
> that IIUC any usage of {{CollectionPropsWatcher}} could potentially result in 
> this type of deadlock situation. 
> {noformat}
> "TEST-CollectionPropsTest.testReadWriteCached-seed#[D3C6921874D1CFEB]" #15 
> prio=5 os_prio=0 cpu=567.78ms elapsed=682.12s tid=0x7
> fa5e8343800 nid=0x3f61 waiting for monitor entry  [0x7fa62d222000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.solr.common.cloud.ZkStateReader.lambda$removeCollectionPropsWatcher$20(ZkStateReader.java:2001)
> - waiting to lock <0xe6207500> (a 
> java.util.concurrent.ConcurrentHashMap)
> at 
> org.apache.solr.common.cloud.ZkStateReader$$Lambda$617/0x0001006c1840.apply(Unknown
>  Source)
> at 
> java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1932)
> - locked <0xeb9156b8> (a 
> java.util.concurrent.ConcurrentHashMap$Node)
> at 
> org.apache.solr.common.cloud.ZkStateReader.removeCollectionPropsWatcher(ZkStateReader.java:1994)
> at 
> org.apache.solr.cloud.CollectionPropsTest.testReadWriteCached(CollectionPropsTest.java:125)
> ...
> "zkCallback-88-thread-2" #213 prio=5 os_prio=0 cpu=14.06ms elapsed=672.65s 
> tid=0x7fa6041bf000 nid=0x402f waiting for monitor ent
> ry  [0x7fa5b8f39000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1923)
> - waiting to lock <0xeb9156b8> (a 
> java.util.concurrent.ConcurrentHashMap$Node)
> at 
> org.apache.solr.common.cloud.ZkStateReader$PropsNotification.(ZkStateReader.java:2262)
> at 
> org.apache.solr.common.cloud.ZkStateReader.notifyPropsWatchers(ZkStateReader.java:2243)
> at 
> org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.refreshAndWatch(ZkStateReader.java:1458)
> - locked <0xe6207500> (a 
> java.util.concurrent.ConcurrentHashMap)
> at 
> org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.process(ZkStateReader.java:1440)
> at 
> org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor.lambda$process$1(SolrZkClient.java:838)
> at 
> org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor$$Lambda$253/0x0001004a4440.run(Unknown
>  Source)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515)
> at 
> java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$140/0x000100308c40.run(Unknown
>  Source)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13678) ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent zkCallback thread on props watcher

2019-08-02 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13678:

Attachment: collectionpropswatcher-deadlock-jstack.txt
Status: Open  (was: Open)

attaching the full jstack output that i captured from observing this during a 
run of {{CollectionPropsTest.testReadWriteCached}} (ie: the source of the 
snippet included in the summary)

Please note that i captured this threaddump while in the process of testing 
some unrelated changes to other methods in {{CollectionPropsTest}} -- i believe 
all of my local changes to that test class at the time this thread dump was 
captured were to code that appeared farther down in the test file then any line 
numbers that might be mentioned in this threaddump, so all line numbers should 
be accurate on master circa ~ 52b5ec8068, but i'm not 100% certain.  the key 
thing to focus on is the line numbers and callstack for the non-test code  
i am 100% certain i had no local changes to the 
{{CollectionPropsTest.testReadWriteCached}}, or any non-test code.

> ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent 
> zkCallback thread on props watcher
> --
>
> Key: SOLR-13678
> URL: https://issues.apache.org/jira/browse/SOLR-13678
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: collectionpropswatcher-deadlock-jstack.txt
>
>
> while investigating an (unrelated) test bug in CollectionPropsTest I 
> discovered a deadlock situation that can occur when calling 
> {{ZkStateReader.removeCollectionPropsWatcher()}} if a zkCallback thread tries 
> to concurrently fire the watchers set on the collection props.
> {{ZkStateReader.removeCollectionPropsWatcher()}} is itself called when a 
> {{CollectionPropsWatcher.onStateChanged()}} impl returns "true" -- meaning 
> that IIUC any usage of {{CollectionPropsWatcher}} could potentially result in 
> this type of deadlock situation. 
> {noformat}
> "TEST-CollectionPropsTest.testReadWriteCached-seed#[D3C6921874D1CFEB]" #15 
> prio=5 os_prio=0 cpu=567.78ms elapsed=682.12s tid=0x7
> fa5e8343800 nid=0x3f61 waiting for monitor entry  [0x7fa62d222000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.solr.common.cloud.ZkStateReader.lambda$removeCollectionPropsWatcher$20(ZkStateReader.java:2001)
> - waiting to lock <0xe6207500> (a 
> java.util.concurrent.ConcurrentHashMap)
> at 
> org.apache.solr.common.cloud.ZkStateReader$$Lambda$617/0x0001006c1840.apply(Unknown
>  Source)
> at 
> java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1932)
> - locked <0xeb9156b8> (a 
> java.util.concurrent.ConcurrentHashMap$Node)
> at 
> org.apache.solr.common.cloud.ZkStateReader.removeCollectionPropsWatcher(ZkStateReader.java:1994)
> at 
> org.apache.solr.cloud.CollectionPropsTest.testReadWriteCached(CollectionPropsTest.java:125)
> ...
> "zkCallback-88-thread-2" #213 prio=5 os_prio=0 cpu=14.06ms elapsed=672.65s 
> tid=0x7fa6041bf000 nid=0x402f waiting for monitor ent
> ry  [0x7fa5b8f39000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1923)
> - waiting to lock <0xeb9156b8> (a 
> java.util.concurrent.ConcurrentHashMap$Node)
> at 
> org.apache.solr.common.cloud.ZkStateReader$PropsNotification.(ZkStateReader.java:2262)
> at 
> org.apache.solr.common.cloud.ZkStateReader.notifyPropsWatchers(ZkStateReader.java:2243)
> at 
> org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.refreshAndWatch(ZkStateReader.java:1458)
> - locked <0xe6207500> (a 
> java.util.concurrent.ConcurrentHashMap)
> at 
> org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.process(ZkStateReader.java:1440)
> at 
> org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor.lambda$process$1(SolrZkClient.java:838)
> at 
> org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor$$Lambda$253/0x0001004a4440.run(Unknown
>  Source)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515)
> at 
> java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$140/0x000100308c40.run(Unknown
>  Source)
> at 
> 

[jira] [Created] (SOLR-13678) ZkStateReader.removeCollectionPropsWatcher can deadlock with concurrent zkCallback thread on props watcher

2019-08-02 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13678:
---

 Summary: ZkStateReader.removeCollectionPropsWatcher can deadlock 
with concurrent zkCallback thread on props watcher
 Key: SOLR-13678
 URL: https://issues.apache.org/jira/browse/SOLR-13678
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man


while investigating an (unrelated) test bug in CollectionPropsTest I discovered 
a deadlock situation that can occur when calling 
{{ZkStateReader.removeCollectionPropsWatcher()}} if a zkCallback thread tries 
to concurrently fire the watchers set on the collection props.

{{ZkStateReader.removeCollectionPropsWatcher()}} is itself called when a 
{{CollectionPropsWatcher.onStateChanged()}} impl returns "true" -- meaning that 
IIUC any usage of {{CollectionPropsWatcher}} could potentially result in this 
type of deadlock situation. 

{noformat}
"TEST-CollectionPropsTest.testReadWriteCached-seed#[D3C6921874D1CFEB]" #15 
prio=5 os_prio=0 cpu=567.78ms elapsed=682.12s tid=0x7
fa5e8343800 nid=0x3f61 waiting for monitor entry  [0x7fa62d222000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.solr.common.cloud.ZkStateReader.lambda$removeCollectionPropsWatcher$20(ZkStateReader.java:2001)
- waiting to lock <0xe6207500> (a 
java.util.concurrent.ConcurrentHashMap)
at 
org.apache.solr.common.cloud.ZkStateReader$$Lambda$617/0x0001006c1840.apply(Unknown
 Source)
at 
java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1932)
- locked <0xeb9156b8> (a 
java.util.concurrent.ConcurrentHashMap$Node)
at 
org.apache.solr.common.cloud.ZkStateReader.removeCollectionPropsWatcher(ZkStateReader.java:1994)
at 
org.apache.solr.cloud.CollectionPropsTest.testReadWriteCached(CollectionPropsTest.java:125)

...

"zkCallback-88-thread-2" #213 prio=5 os_prio=0 cpu=14.06ms elapsed=672.65s 
tid=0x7fa6041bf000 nid=0x402f waiting for monitor ent
ry  [0x7fa5b8f39000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
java.util.concurrent.ConcurrentHashMap.compute(java.base@11.0.3/ConcurrentHashMap.java:1923)
- waiting to lock <0xeb9156b8> (a 
java.util.concurrent.ConcurrentHashMap$Node)
at 
org.apache.solr.common.cloud.ZkStateReader$PropsNotification.(ZkStateReader.java:2262)
at 
org.apache.solr.common.cloud.ZkStateReader.notifyPropsWatchers(ZkStateReader.java:2243)
at 
org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.refreshAndWatch(ZkStateReader.java:1458)
- locked <0xe6207500> (a java.util.concurrent.ConcurrentHashMap)
at 
org.apache.solr.common.cloud.ZkStateReader$PropsWatcher.process(ZkStateReader.java:1440)
at 
org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor.lambda$process$1(SolrZkClient.java:838)
at 
org.apache.solr.common.cloud.SolrZkClient$ProcessWatchWithExecutor$$Lambda$253/0x0001004a4440.run(Unknown
 Source)
at 
java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515)
at 
java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$140/0x000100308c40.run(Unknown
 Source)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)

{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-08-01 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13664:

   Resolution: Fixed
Fix Version/s: 8.3
   master (9.0)
   Status: Resolved  (was: Patch Available)

>  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
> --
>
> Key: SOLR-13664
> URL: https://issues.apache.org/jira/browse/SOLR-13664
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, 
> SOLR-13664.patch
>
>
> Prior to Solr 8.3, the javadocs for {{SolrTestCaseJ4.deleteCore()}} said that 
> that method would delete the dataDir used by {{initCore()}} in spite of that 
> method not actaully doing anything to clean up the dataDir for a very long 
> time (exactly when the bug was introduced is not known)
> For that reason, in most solr versions up to and including 8.2, tests that 
> called combinations of {{initCore()}} / {{deleteCore()}} within a single test 
> class would see the data from a previous core polluting the data of a newly 
> introduced core.
> As part of this jira, this bug was fixed, by udpating {{deleteCore()}} to 
> "reset" the value of the {{initCoreDataDir}} variable to null, so that it 
> can/will be re-initialized on the next call to either {{initCore()}} or the 
> lower level {{createCore()}}. 
> Existing tests that refer to the {{initCoreDataDir}} directly (either before, 
> or during the lifecycle of an active core managed via {{initCore()}} / 
> {{deleteCore()}} ) may encounter {{NullPointerExceptions}} on upgrading to 
> Solr 8.3 as a result of this bug fix.  These tests are encouraged to use the 
> new helper method {{initAndGetDataDir()}} in place of directly refering to 
> the (now deprecated) {{initCoreDataDir}} variable directly.
> Any existing tests that refer to the {{initCoreDataDir}} directly *after* 
> calling {{deleteCore()}} with the intention of inspecting the index contents 
> after shutdown, will need to be modified to preserved the rsults of calling 
> {{initAndGetDataDir()}} into a new variable for such introspection – the 
> actual contents of the directory will not be removed until the full ifecycle 
> of the test class is complete (see {{LuceneTestCase.createTempDir()}})
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-08-01 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13664:

Description: 
Prior to Solr 8.3, the javadocs for {{SolrTestCaseJ4.deleteCore()}} said that 
that method would delete the dataDir used by {{initCore()}} in spite of that 
method not actaully doing anything to clean up the dataDir for a very long time 
(exactly when the bug was introduced is not known)

For that reason, in most solr versions up to and including 8.2, tests that 
called combinations of {{initCore()}} / {{deleteCore()}} within a single test 
class would see the data from a previous core polluting the data of a newly 
introduced core.

As part of this jira, this bug was fixed, by udpating {{deleteCore()}} to 
"reset" the value of the {{initCoreDataDir}} variable to null, so that it 
can/will be re-initialized on the next call to either {{initCore()}} or the 
lower level {{createCore()}}. 

Existing tests that refer to the {{initCoreDataDir}} directly (either before, 
or during the lifecycle of an active core managed via {{initCore()}} / 
{{deleteCore()}} ) may encounter {{NullPointerExceptions}} on upgrading to Solr 
8.3 as a result of this bug fix.  These tests are encouraged to use the new 
helper method {{initAndGetDataDir()}} in place of directly refering to the (now 
deprecated) {{initCoreDataDir}} variable directly.

Any existing tests that refer to the {{initCoreDataDir}} directly *after* 
calling {{deleteCore()}} with the intention of inspecting the index contents 
after shutdown, will need to be modified to preserved the rsults of calling 
{{initAndGetDataDir()}} into a new variable for such introspection – the actual 
contents of the directory will not be removed until the full ifecycle of the 
test class is complete (see {{LuceneTestCase.createTempDir()}})

 

  was:

In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does 
nothing to delete the dataDir used by the TestHarness

The git history is a bit murky, so i'm not entirely certain when this stoped 
working, but I suspect it happened as part of the overall cleanup regarding 
test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> 
TestRuleTemporaryFilesCleanup}}

While this is not problematic in many test classes, where a single 
{{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses that 
SolrCore for all test methods and relies on {{@AfterClass 
SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's problematic 
in test classes where {{deleteCore()}} is explicitly called in an {{@After}} 
method to ensure a unique core (w/unique dataDir) is used for each test method.

(there are currently about 61 tests that call {{deleteCore()}} directly)



updated jira summary to be more helpful to users who may find this jira via 
CHANGES.txt pointer and need more information on how it affects them if they 
have their own custom tests using SolrTestCaseJ4.

>  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
> --
>
> Key: SOLR-13664
> URL: https://issues.apache.org/jira/browse/SOLR-13664
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, 
> SOLR-13664.patch
>
>
> Prior to Solr 8.3, the javadocs for {{SolrTestCaseJ4.deleteCore()}} said that 
> that method would delete the dataDir used by {{initCore()}} in spite of that 
> method not actaully doing anything to clean up the dataDir for a very long 
> time (exactly when the bug was introduced is not known)
> For that reason, in most solr versions up to and including 8.2, tests that 
> called combinations of {{initCore()}} / {{deleteCore()}} within a single test 
> class would see the data from a previous core polluting the data of a newly 
> introduced core.
> As part of this jira, this bug was fixed, by udpating {{deleteCore()}} to 
> "reset" the value of the {{initCoreDataDir}} variable to null, so that it 
> can/will be re-initialized on the next call to either {{initCore()}} or the 
> lower level {{createCore()}}. 
> Existing tests that refer to the {{initCoreDataDir}} directly (either before, 
> or during the lifecycle of an active core managed via {{initCore()}} / 
> {{deleteCore()}} ) may encounter {{NullPointerExceptions}} on upgrading to 
> Solr 8.3 as a result of this bug fix.  These tests are encouraged to use the 
> new helper method {{initAndGetDataDir()}} in place of directly refering to 
> the (now deprecated) {{initCoreDataDir}} variable directly.
> Any existing tests that refer to the {{initCoreDataDir}} directly *after* 
> calling {{deleteCore()}} with the intention of inspecting the 

[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-07-31 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13664:

Status: Patch Available  (was: Open)

>  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
> --
>
> Key: SOLR-13664
> URL: https://issues.apache.org/jira/browse/SOLR-13664
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, 
> SOLR-13664.patch
>
>
> In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does 
> nothing to delete the dataDir used by the TestHarness
> The git history is a bit murky, so i'm not entirely certain when this stoped 
> working, but I suspect it happened as part of the overall cleanup regarding 
> test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> 
> TestRuleTemporaryFilesCleanup}}
> While this is not problematic in many test classes, where a single 
> {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses 
> that SolrCore for all test methods and relies on {{@AfterClass 
> SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's 
> problematic in test classes where {{deleteCore()}} is explicitly called in an 
> {{@After}} method to ensure a unique core (w/unique dataDir) is used for each 
> test method.
> (there are currently about 61 tests that call {{deleteCore()}} directly)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-07-31 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13664:

Attachment: SOLR-13664.patch
Status: Open  (was: Open)

Updated patch to:
 * update all remaining tests that still refered to (the now deprecated 
{{initCoreDataDir}} ) to either use {{initAndGetDataDir()}} or just use 
{{createTempDir()}} when their usage never any reason to re-use the 
{{initCore()}} dataDir anyway
 * fix a few precommit issues (unused imports).

I'm still testing, but i think this is ready...

>  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
> --
>
> Key: SOLR-13664
> URL: https://issues.apache.org/jira/browse/SOLR-13664
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch, 
> SOLR-13664.patch
>
>
> In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does 
> nothing to delete the dataDir used by the TestHarness
> The git history is a bit murky, so i'm not entirely certain when this stoped 
> working, but I suspect it happened as part of the overall cleanup regarding 
> test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> 
> TestRuleTemporaryFilesCleanup}}
> While this is not problematic in many test classes, where a single 
> {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses 
> that SolrCore for all test methods and relies on {{@AfterClass 
> SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's 
> problematic in test classes where {{deleteCore()}} is explicitly called in an 
> {{@After}} method to ensure a unique core (w/unique dataDir) is used for each 
> test method.
> (there are currently about 61 tests that call {{deleteCore()}} directly)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-07-31 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897525#comment-16897525
 ] 

Hoss Man edited comment on SOLR-13664 at 7/31/19 9:03 PM:
--

Testing of the last patch uncovered 3 classes of problems in our existing tests:

# tests trying to call {{FileUtils.deleteDirectory(initCoreDataDir);}} after 
calling {{deleteCore()}} (specifically to work around this bug!) that now get 
NPE
#* Example: TestRecovery
# tests that need to be able to write files to {{initCoreDataDir}} *before* 
calling {{initCore()}} _and_ call {{initCore()}} + {{deleteCore()}} in the 
intividual test method life cycle
#* the patch already ensures {{initCoreDataDir}} is created before the subclass 
is initialized, so tests that just used a single {{initCore()}} call for all 
test methods would be fine -- it's only tests that also {{deleteCore()}} in 
{{@After}} methods that are problems
#* Example: QueryElevationComponentTest, SolrCoreCheckLockOnStartupTest
# one test that doesn't even use {{initCore()}} -- it builds it's own 
TestHarness/CoreContainer using {{initCoreDataDir}} directly -- but like #2, 
calls {{deleteCore()}} in {{@After}} methods (to leverate the common cleanup of 
the TestHarness)
#* Example: SolrMetricsIntegrationTest


Based on these classes of problems, I think the best way forward is to update 
the existing patch to:

* make the {{initAndGetDataDir()}} private helper method I introduced in the 
last patch public and change it to return the {{File}} (not String) and beef up 
it's javadocs
* deprecate {{initCoreDataDir}} and change all existing direct uses of it in 
our tests to use the helper method

This makes fixing the existing test problems trivial: just replace all uses of 
{{initCoreDataDir}} to {{initAndGetDataDir()}} ... any logic attempting to 
seed/inspect the dataDir prior to {{initCore()}} will initialize the directory 
that will be used by the next subsequent {{initCore()}} call.



I've attached an updated patch with this changes,  but! ... for completeness, I 
think it's important to also consider about how this will impact any existing 
third-party tests downstream users may have written that subclass 
{{SolrTestCaseJ4}:

* if they don't refer to {{initCoreDataDir}} directly, then like the existing 
patch the only change in behavior they should notice is that if their tests 
call {{deleteCore()}} any subsequent {{initCore()}} calls won't be polluted 
with the old data.
* if they do refer to {{initCoreDataDir}} directly in tests, then that usage 
_may_ continue to work as is if the usage is only "between" calls to 
{{initCore()}} and {{deleteCore()}} (ie: to inspect the data dir)
* if they attempt to use {{initCoreDataDir}} _after_ calling {{deleteCore()}} 
(either directly, or indirectly by referencing it before a call to 
{{initCore()}} in a test lifecycle that involves multiple {{initCore()}} + 
{{deleteCore()}} pairs) then they will start getting NPEs and will need to 
change their test to use {{initAndGetDataDir()}} directly.

I think the tradeoff of fixing this bug vs the impact on end users is worth 
making this change: right now the bug can silently affect users w/weird 
results, but any tests that are impacted adversly by this change will trigger 
loud NPEs and have an easy fix we can mention in the ugprade notes.




was (Author: hossman):

Testing of the last patch uncovered 3 classes of problems in our existing tests:

# tests trying to call {{FileUtils.deleteDirectory(initCoreDataDir);}} after 
calling {{deleteCore()}} (specifically to work around this bug!) that now get 
NPE
#* Example: TestRecovery
# tests that need to be able to write files to {{initCoreDataDir}} *before* 
calling {{initCore()}} _and_ call {{initCore()}} + {{deleteCore()}} in the 
intividual test method life cycle
#* the patch already ensures {{initCoreDataDir}} is created before the subclass 
is initialized, so tests that just used a single {{initCore()}} call for all 
test methods would be fine -- it's only tests that also {{deleteCore()}} in 
{{@After}} methods that are problems
#* Example: QueryElevationComponentTest, SolrCoreCheckLockOnStartupTest
* one test that doesn't even use {{initCore()}} -- it builds it's own 
TestHarness/CoreContainer using {{initCoreDataDir}} directly -- but like #2, 
calls {{deleteCore()}} in {{@After}} methods (to leverate the common cleanup of 
the TestHarness)
** Example: SolrMetricsIntegrationTest


Based on these classes of problems, I think the best way forward is to update 
the existing patch to:

* make the {{initAndGetDataDir()}} private helper method I introduced in the 
last patch public and change it to return the {{File}} (not String) and beef up 
it's javadocs
* deprecate {{initCoreDataDir}} and change all existing direct uses of it in 
our tests to use the helper method

This makes fixing the existing test problems trivial: just 

[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-07-31 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13664:

Attachment: SOLR-13664.patch
Status: Open  (was: Open)


Testing of the last patch uncovered 3 classes of problems in our existing tests:

# tests trying to call {{FileUtils.deleteDirectory(initCoreDataDir);}} after 
calling {{deleteCore()}} (specifically to work around this bug!) that now get 
NPE
#* Example: TestRecovery
# tests that need to be able to write files to {{initCoreDataDir}} *before* 
calling {{initCore()}} _and_ call {{initCore()}} + {{deleteCore()}} in the 
intividual test method life cycle
#* the patch already ensures {{initCoreDataDir}} is created before the subclass 
is initialized, so tests that just used a single {{initCore()}} call for all 
test methods would be fine -- it's only tests that also {{deleteCore()}} in 
{{@After}} methods that are problems
#* Example: QueryElevationComponentTest, SolrCoreCheckLockOnStartupTest
* one test that doesn't even use {{initCore()}} -- it builds it's own 
TestHarness/CoreContainer using {{initCoreDataDir}} directly -- but like #2, 
calls {{deleteCore()}} in {{@After}} methods (to leverate the common cleanup of 
the TestHarness)
** Example: SolrMetricsIntegrationTest


Based on these classes of problems, I think the best way forward is to update 
the existing patch to:

* make the {{initAndGetDataDir()}} private helper method I introduced in the 
last patch public and change it to return the {{File}} (not String) and beef up 
it's javadocs
* deprecate {{initCoreDataDir}} and change all existing direct uses of it in 
our tests to use the helper method

This makes fixing the existing test problems trivial: just replace all uses of 
{{initCoreDataDir}} to {{initAndGetDataDir()}} ... any logic attempting to 
seed/inspect the dataDir prior to {{initCore()}} will initialize the directory 
that will be used by the next subsequent {{initCore()}} call.



I've attached an updated patch with this changes,  but! ... for completeness, I 
think it's important to also consider about how this will impact any existing 
third-party tests downstream users may have written that subclass 
{{SolrTestCaseJ4}:

* if they don't refer to {{initCoreDataDir}} directly, then like the existing 
patch the only change in behavior they should notice is that if their tests 
call {{deleteCore()}} any subsequent {{initCore()}} calls won't be polluted 
with the old data.
* if they do refer to {{initCoreDataDir}} directly in tests, then that usage 
_may_ continue to work as is if the usage is only "between" calls to 
{{initCore()}} and {{deleteCore()}} (ie: to inspect the data dir)
* if they attempt to use {{initCoreDataDir}} _after_ calling {{deleteCore()}} 
(either directly, or indirectly by referencing it before a call to 
{{initCore()}} in a test lifecycle that involves multiple {{initCore()}} + 
{{deleteCore()}} pairs) then they will start getting NPEs and will need to 
change their test to use {{initAndGetDataDir()}} directly.

I think the tradeoff of fixing this bug vs the impact on end users is worth 
making this change: right now the bug can silently affect users w/weird 
results, but any tests that are impacted adversly by this change will trigger 
loud NPEs and have an easy fix we can mention in the ugprade notes.



>  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
> --
>
> Key: SOLR-13664
> URL: https://issues.apache.org/jira/browse/SOLR-13664
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13664.patch, SOLR-13664.patch, SOLR-13664.patch
>
>
> In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does 
> nothing to delete the dataDir used by the TestHarness
> The git history is a bit murky, so i'm not entirely certain when this stoped 
> working, but I suspect it happened as part of the overall cleanup regarding 
> test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> 
> TestRuleTemporaryFilesCleanup}}
> While this is not problematic in many test classes, where a single 
> {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses 
> that SolrCore for all test methods and relies on {{@AfterClass 
> SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's 
> problematic in test classes where {{deleteCore()}} is explicitly called in an 
> {{@After}} method to ensure a unique core (w/unique dataDir) is used for each 
> test method.
> (there are currently about 61 tests that call {{deleteCore()}} directly)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (SOLR-13579) Create resource management API

2019-07-31 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897347#comment-16897347
 ] 

Hoss Man commented on SOLR-13579:
-


bq. We could perhaps call a type-safe and name-safe component API from a 
generic management API by following a similar convention as the one used in 
SolrPluginUtils.invokeSetters? Or use marker interfaces that also provide 
validation / conversion. I'll look into this.

Unless there's something i'm missing (and that's incredibly likely) I don't 
even think you'd need a SolrPluginUtils.invokeSetters type hack for any of this 
-- except maybe mapping REST commands in the ResourceManagerHandler to methods 
in the ResourceManagerPlugins?

what i was imagining was a more straightfoward subclass/subinterface 
relationship and using generics to tightly couple the ManagedComponent impls to 
the corresponding ResourceManagerPlugins -- so the plugins could hav a 
completey staticly typed APIs for calling methods on the Components.  ala...

{code}
public interface ManagedComponent {
  ManagedComponentId getManagedComponentId();
  ...
}

public abstract ResourceManagerPlugin {
  /** if needed by ResourceManagerHandler or metrics */
  public abstract void setResourceLimits(ManagedComponentId component, 
Map limits);
  /** if needed by ResourceManagerHandler or metrics */
  public abstract Map getResourceLimits(ManagedComponentId 
component);
  ...
  // other general API methods needed for linking/registering type "T" 
components
  // (or Pool) and for "managing" all of them...
  ...
}

public interface ManagedCacheComponent implements ManagedComponent {
  // actual caches implement this, and only have to worry about type specific 
methods
  // for managing their resource realted settings -- nothing about the REST 
API...
  public void setMaxSize(long size);
  public void setMaxRamMB(int maxRamMB);
  public long getMaxSize();
  public int getMaxRamMB();
}

public class CacheManagerPlugin extends 
ResourceManagerPlugin {
  // comncrete impls like this can use the staticly typed get/set methods of 
the concrete
  // ManagedComponent impls in their getResourceLimits/setResourceLimits & 
manage methods
  ...
}
{code}



> Create resource management API
> --
>
> Key: SOLR-13579
> URL: https://issues.apache.org/jira/browse/SOLR-13579
> Project: Solr
>  Issue Type: New Feature
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch, 
> SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch
>
>
> Resource management framework API supporting the goals outlined in SOLR-13578.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-07-30 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13664:

Attachment: SOLR-13664.patch
Status: Open  (was: Open)


Here's an updated patch with a fix that i _think_ is good, but i'm still in the 
process of testing and I want spend some more time thinking through possible 
ramifications on third party subclasses.

The basic idea is that {{deleteCore()}} now nulls out the {{initCoreDataDir}} 
variable  -- w/o doing any actual IO deletion.  We still trust/rely on 
{{TestRuleTemporaryFilesCleanup}} to do it's job of deleting these temp dirs if 
the test succeeds.  Any place in {{SolrTestCaseJ4}} that currently depends on 
{{initCoreDataDir}} being set now uses a private helper method to ensure it's 
initialized.





>  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
> --
>
> Key: SOLR-13664
> URL: https://issues.apache.org/jira/browse/SOLR-13664
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13664.patch, SOLR-13664.patch
>
>
> In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does 
> nothing to delete the dataDir used by the TestHarness
> The git history is a bit murky, so i'm not entirely certain when this stoped 
> working, but I suspect it happened as part of the overall cleanup regarding 
> test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> 
> TestRuleTemporaryFilesCleanup}}
> While this is not problematic in many test classes, where a single 
> {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses 
> that SolrCore for all test methods and relies on {{@AfterClass 
> SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's 
> problematic in test classes where {{deleteCore()}} is explicitly called in an 
> {{@After}} method to ensure a unique core (w/unique dataDir) is used for each 
> test method.
> (there are currently about 61 tests that call {{deleteCore()}} directly)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-07-30 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13664:

Attachment: SOLR-13664.patch
Status: Open  (was: Open)

The attached patch doesn't fix the problem – still thinking about the best 
solution to move forward – but it does trivially demonstrates this problem in a 
new test. It also updates {{TestUseDocValuesAsStored}} to include a sanity 
check against this problem.

A weird {{TestUseDocValuesAsStored}} jenkins failure is how i discovered this 
in the first place...

apache_Lucene-Solr-Tests-8.2_34.log.txt
{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestUseDocValuesAsStored -Dtests.method=testDuplicateMultiValued 
-Dtests.seed=69AC8730651B9CCD -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=ja -Dtests.timezone=America/Argentina/ComodRivadavia 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   1.13s J0 | 
TestUseDocValuesAsStored.testDuplicateMultiValued <<<
   [junit4]> Throwable #1: java.lang.RuntimeException: Exception during 
query
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([69AC8730651B9CCD:87719310ABAA6A71]:0)
   [junit4]>at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:947)
   [junit4]>at 
org.apache.solr.schema.TestUseDocValuesAsStored.doTest(TestUseDocValuesAsStored.java:367)
   [junit4]>at 
org.apache.solr.schema.TestUseDocValuesAsStored.testDuplicateMultiValued(TestUseDocValuesAsStored.java:172)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
   [junit4]> Caused by: java.lang.RuntimeException: REQUEST FAILED: 
xpath=//arr[@name='enums_dvo']/str[.='Not Available']
   [junit4]>xml response was: 
   [junit4]> 
   [junit4]> 00xyzmyid1XY2XXY3XY4-66642425-66.6664.24.26-6664204207-6.E-50.00420.004281999-12-31T23:59:59Z2016-07-04T03:02:01Z2016-07-04T03:02:01Z
   [junit4]> 
   [junit4]>request was:q=*:*=*=xml
   [junit4]>at 
org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:940)

{noformat}
...what's happening here is that docs from previous test methods in this class 
(that should have been using their own distinct cores + dataDirs) are bleeding 
into this test, causing the doc the test is checking for in to be pushed out 
past the {{rows=10}} results. (note the {{numFound="11"}})

>  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
> --
>
> Key: SOLR-13664
> URL: https://issues.apache.org/jira/browse/SOLR-13664
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13664.patch
>
>
> In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does 
> nothing to delete the dataDir used by the TestHarness
> The git history is a bit murky, so i'm not entirely certain when this stoped 
> working, but I suspect it happened as part of the overall cleanup regarding 
> test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> 
> TestRuleTemporaryFilesCleanup}}
> While this is not problematic in many test classes, where a single 
> {{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses 
> that SolrCore for all test methods and relies on {{@AfterClass 
> SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's 
> problematic in test classes where {{deleteCore()}} is explicitly called in an 
> {{@After}} method to ensure a unique core (w/unique dataDir) is used for each 
> test method.
> (there are currently about 61 tests that call {{deleteCore()}} directly)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13664) SolrTestCaseJ4.deleteCore() does not delete/clean dataDir

2019-07-30 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13664:
---

 Summary:  SolrTestCaseJ4.deleteCore() does not delete/clean dataDir
 Key: SOLR-13664
 URL: https://issues.apache.org/jira/browse/SOLR-13664
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
Assignee: Hoss Man



In spite of what it's javadocs say, {{SolrTestCaseJ4.deleteCore()}} does 
nothing to delete the dataDir used by the TestHarness

The git history is a bit murky, so i'm not entirely certain when this stoped 
working, but I suspect it happened as part of the overall cleanup regarding 
test temp dirs and the use of {{LuceneTestCase.createTempDir(...) -> 
TestRuleTemporaryFilesCleanup}}

While this is not problematic in many test classes, where a single 
{{initCore(...) is called in a {{@BeforeClass}} and the test then re-uses that 
SolrCore for all test methods and relies on {{@AfterClass 
SolrTestCaseJ4.teardownTestCases()}} to call {{deleteCore()}}, it's problematic 
in test classes where {{deleteCore()}} is explicitly called in an {{@After}} 
method to ensure a unique core (w/unique dataDir) is used for each test method.

(there are currently about 61 tests that call {{deleteCore()}} directly)




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken

2019-07-30 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13660:

   Resolution: Fixed
Fix Version/s: 8.3
   master (9.0)
   Status: Resolved  (was: Patch Available)

> AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken
> -
>
> Key: SOLR-13660
> URL: https://issues.apache.org/jira/browse/SOLR-13660
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.3
>
> Attachments: SOLR-13660.patch
>
>
> {{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, 
> and does not actually check that the replicas are active.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken

2019-07-29 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13660:

Attachment: SOLR-13660.patch
Status: Open  (was: Open)



Allthough this method is not used directly in many Solr tests that subclass 
{{AbstractFullDistribZkTestBase}} it is used by other methods in 
{{AbstractFullDistribZkTestBase}} -- including when creating the 
{{DEFAULT_COLLECTION}}.

Because of the esoteric way {{AbstractFullDistribZkTestBase}} initializes it's 
collections (and jetty instances) almost every replica created starts in 
recovery -- so as a result of this bug, subclasses may frequently see their 
test methods being invoked before the expected number of shards/replicas.

In at least one case (TestCloudSchemaless) this has lead to test failures 
(ultimately due to requests timing out when trying to add documents) as a 
result of test client operations competing with multiple concurrent replica 
recoveries on CPU constrained jenkins machines.



The attached patch:

* fixes {{waitForActiveReplicaCount(...)}} to check that the replicas are active
* deprecates and updates the javadocs of {{getTotalReplicas(...)}} to make it 
clear that this method doesn't care about the status of the replica.
** this method was formally used by {{waitForActiveReplicaCount(...)}}
* also makes some related fixes to {{createJettys(...)}}:
** adds some comments clarifying how this method initializes the shards vs 
addingthe replicas
** improves the initial slice count check to use existing helper methods which 
also verifies the slices are active
*** this doesn't really affect the correctness of the method given how the 
collection is used at this point, but helps simplify the code.




> AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken
> -
>
> Key: SOLR-13660
> URL: https://issues.apache.org/jira/browse/SOLR-13660
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13660.patch
>
>
> {{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, 
> and does not actually check that the replicas are active.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken

2019-07-29 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13660:

Status: Patch Available  (was: Open)

> AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken
> -
>
> Key: SOLR-13660
> URL: https://issues.apache.org/jira/browse/SOLR-13660
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-13660.patch
>
>
> {{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, 
> and does not actually check that the replicas are active.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13660) AbstractFullDistribZkTestBase.waitForActiveReplicaCount is broken

2019-07-29 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13660:
---

 Summary: AbstractFullDistribZkTestBase.waitForActiveReplicaCount 
is broken
 Key: SOLR-13660
 URL: https://issues.apache.org/jira/browse/SOLR-13660
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
Assignee: Hoss Man



{{AbstractFullDistribZkTestBase.waitForActiveReplicaCount(...)}} is broken, and 
does not actually check that the replicas are active.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

2019-07-26 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-13599.
-
Resolution: Cannot Reproduce

not a single jenkins failure in this test since backporting the logging 
additions to brach_8x on July8.

doesn't seem like there is much more we can do here

> ReplicationFactorTest high failure rate on Windows jenkins VMs after 
> 2019-06-22 OS/java upgrades
> 
>
> Key: SOLR-13599
> URL: https://issues.apache.org/jira/browse/SOLR-13599
> Project: Solr
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: thetaphi_Lucene-Solr-master-Windows_8025.log.txt
>
>
> We've started seeing some weirdly consistent (but not reliably reproducible) 
> failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
> machines.
> The failures all seem to have started on June 22 -- when Uwe upgraded his 
> Windows VMs to upgrade the Java version, but happen across all versions of 
> java tested, and on both the master and branch_8x.
> While this test failed a total of 5 times, in different ways, on various 
> jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on 
> all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and 
> when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins 
> builds frequently fails anywhere from 1-4 additional times.
> All of these failures occur in the exact same place, with the exact same 
> assertion: that the expected replicationFactor of 2 was not achieved, and an 
> rf=1 (ie: only the master) was returned, when sending a _batch_ of documents 
> to a collection with 1 shard, 3 replicas; while 1 of the replicas was 
> partitioned off due to a closed proxy.
> In the handful of logs I've examined closely, the 2nd "live" replica does in 
> fact log that it recieved & processed the update, but with a QTime of over 30 
> seconds, and it then it immediately logs an 
> {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
> meanwhile, the leader has one ({{updateExecutor}} thread logging copious 
> amount of {{java.net.ConnectException: Connection refused: no further 
> information}} regarding the replica that was partitioned off, before a second 
> {{updateExecutor}} thread ultimately logs 
> {{java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
> replica.
> 
> What makes this perplexing is that this is not the first time in the test 
> that documents were added to this collection while one replica was 
> partitioned off, but it is the first time that all 3 of the following are 
> true _at the same time_:
> # the collection has recovered after some replicas were partitioned and 
> re-connected
> # a batch of multiple documents is being added
> # one replica has been "re" partitioned.
> ...prior to the point when this failure happens, only individual document 
> adds were tested while replicas where partitioned.  Batches of adds were only 
> tested when all 3 replicas were "live" after the proxies were re-opened and 
> the collection had fully recovered.  The failure also comes from the first 
> update to happen after a replica's proxy port has been "closed" for the 
> _second_ time.
> While this conflagration of events might concievible trigger some weird bug, 
> what makes these failures _particularly_ perplexing is that:
> * the failures only happen on Windows
> * the failures only started after the Windows VM update on June-22.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13579) Create resource management API

2019-07-26 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894184#comment-16894184
 ] 

Hoss Man commented on SOLR-13579:
-

Honestly, i'm still very lost.

Part of my struggle is i'm trying to wade into the patch, and review the APIs 
and functionality it contains, while knowing – as you mentioned – that's not 
all the details are here, and it's not fully fleshed out w/everything you 
intend as far as configuration and customization and having more concrete 
implementations beyond just the {{CacheManagerPlugin}}.

I know that in your mind there is more that can/should be done, and that some 
of this code is just "placeholder" for later, but i don't have enough 
familiarity with the "long term" plan to really understand what in the current 
patch is placeholder or stub APIs, vs what is "real" and exists because of long 
term visions for how all of these pieces can be used together in a more 
generalized system – ie: what classes might have surface APIs that look more 
complex then needed given what's currently implemented in the patch, because of 
how you envinsion those classes being used in the future?

Just to pick one example, was my question about the "ResourceManagerPool" vs 
"ResourceManagerPlugin" – in your reply you said...
{quote}The code in ResourceManagerPool is independent of the type of 
resource(s) that a pool can manage. ...
{quote}
...but the code in {{ResourceManagerPlugin}} is _also_ independent of any 
specific type of resource(s) that a pool can manage – those specifics only 
exist in the concrete subclasses. Hence the crux of my question is why theses 
two very generalized pieces of abstract functionality/data collection couldn't 
just be a single abstract base class for all (concrete) ResourceManagerPlugin 
subclasses to extend?

Your followup gives a clue...
{quote}...perhaps at some point we could allow a single pool to manage several 
aspects of a component, in which case a pool could have several plugins.
{quote}
but w/o some "concrete hypothetical" examples of what that might look like, 
it's hard to evaluate if the current APIs are the "best" approach, or if maybe 
there is something better/simpler.
{quote}Also, there can be different pools of the same type, each used for a 
different group of components that support the same management aspect. For 
example, for searcher caches we may want to eventually create separate pools 
for filterCache, queryResultCache and fieldValueCache. All of these pools would 
use the same plugin implementation CacheManagerPlugin but configured with 
different params and limits.
{quote}
But even in this situation, there could be multiple *instances* of a 
{{CacheManagerPlugin}}, one for each pool, each with different params and 
limits, w/o needing distinction between the {{ResourceManagerPlugin}} 
concept/instances and the {{ResourceManagerPool}} concept/instances.

(To be clear, i'm not trying to harp on the specific design/seperation/linkage 
of {{ResourceManagerPlugin}} vs {{ResourceManagerPool}} – these are just some 
of the first classes i looked at and had questions about. I'm just using them 
as examples of where/how it's hard to ask questions or form opinions about the 
current API/code w/o having a better grasp of some "concrete specifcs" (or even 
"hypothetical specifics") of when/how/where/why each of these APIs are expected 
to be used and interact w/each other.

Another example of where i got lost as to the specific motivation behind some 
of these APIs in the long term view is in the "loose coupling" that currently 
exists in the patch between the {{ManagedComponent}} API and 
{{ResourceManagerPlugin}}:
 As i understand it:
 * An object in Solr supports being managed by a particular subclass of 
{{ResourceManagerPlugin}} if and only if it extends {{ManagedComponent}} and 
implementes {{ManagedComponent.getManagedResourceTypes()}} such that the 
resulting {{Collection}} contains a String matching the return value of 
a {{ResourceManagerPlugin.getType()}} for that particular 
{{ResourceManagerPlugin}}
 ** ie: {{SolrCache}} extends the {{ManagedComponent}} interface, and all 
classess implementeing {{SolrCache}} should/must implement 
{{getManagedResourceTypes()}} by returning a java {{Collection}} containing 
{{CacheManagerPlugin.TYPE}}
 * once some {{ManagedComponent}} instances are "registered in a pool" and 
managed by a specific {{ResourceManagerPlugin}} intsance then that plugin 
expects to be able to call {{ManagedComponent.setResourceLimits(Map limits)}} and {{ManagedComponent.getResourceLimits()}} on all of those 
{{ManagedComponent}} instances, and that both Maps should contain/support a set 
of {{String}} keys specific to that {{ResourceManagerPlugin}} subclass acording 
to {{ResourceManagerPlugin.getControlledParams()}}
 ** ie: {{CacheManagerPlugin.getControlledParams()}} returns a java 
{{Collection}} containing 

[jira] [Created] (SOLR-13654) Scary jenkins failure related to collection creation: "non legacy mode coreNodeName missing"

2019-07-25 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13654:
---

 Summary: Scary jenkins failure related to collection creation: 
"non legacy mode coreNodeName missing"
 Key: SOLR-13654
 URL: https://issues.apache.org/jira/browse/SOLR-13654
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
 Attachments: thetaphi_Lucene-Solr-8.2-Linux_452.log.txt

A recent SplitShardTest jenkins failure has a perplexing error that i've been 
updable to reproduce...

{noformat}
   [junit4]> Throwable #1: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://127.0.0.1:36447/_bx/t: Underlying core creation failed 
while creating collection: shardSplitWithRule_link
{noformat}

...this exception is thrown when attempting to create a brand new 1x2 
collection (prior to any splitting) using the following rule/request...

{noformat}
CollectionAdminRequest.Create createRequest = 
CollectionAdminRequest.createCollection(collectionName, "conf1", 1, 2)
.setRule("shard:*,replica:<2,node:*");
{noformat}

...the logs indicate that the specific problem is that the CREATE SolrCore 
commands aren't inlcuding a 'coreNodeName' which is mandatory because this is a 
"non legacy" clusuter...

{noformat}
   [junit4]   2> 1090551 ERROR (OverseerThreadFactory-6577-thread-5) [ ] 
o.a.s.c.a.c.OverseerCollectionMessageHandler Error from shard: 
http://127.0.0.1:36447/_bx/t
   [junit4]   2>   => 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://127.0.0.1:36447/_bx/t: Error CREATEing SolrCore 
'shardSplitWithRule_link_shard1_replica_n1': non legacy mode coreNodeName 
missing {collection.configName=conf1, numShards=1, shard=shard1, 
collection=shardSplitWithRule_link, replicaType=NRT}
   [junit4]   2>at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656)
   [junit4]   2> 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://127.0.0.1:36447/_bx/t: Error CREATEing SolrCore 
'shardSplitWithRule_link_shard1_replica_n1': non legacy mode coreNodeName 
missing {collection.configName=conf1, numShards=1, shard=shard1, 
collection=shardSplitWithRule_link, replicaType=NRT}
   [junit4]   2>at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1274) ~[java/:?]
   [junit4]   2>at 
org.apache.solr.handler.component.HttpShardHandlerFactory$1.request(HttpShardHandlerFactory.java:176)
 ~[java/:?]
   [junit4]   2>at 
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:199)
 ~[java/:?]
   [junit4]   2>at 
java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
{noformat}

...so how/why is the Overseer generating CREATE core commands w/o coreNodeName 
params 

Is this a race condition between the test setting legacyCloud=false and the 
Overseer processing the CREATE collection Op?




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13653) java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches

2019-07-25 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-13653.
-
Resolution: Information Provided
  Assignee: Hoss Man

> java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches
> ---
>
> Key: SOLR-13653
> URL: https://issues.apache.org/jira/browse/SOLR-13653
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
>  Labels: Java10, java11
>
> we've seen some java11 jenkins builds that have failed due to an 
> AssertionError being thrown by HashMap.put as used in LRUCache -- in at least 
> one case these failures are semi-reproducible. (the occasional "success" is 
> likely due to some unpredictibility in thread contention)
> Some cursory investigation suggests that JDK-8205399, first identified in 
> java10, and fixed in java12-b26...
> https://bugs.openjdk.java.net/browse/JDK-8205399
> There does not appear to be anything we can do to mitigate this problem in 
> Solr.  
> It's also not clear to me based on th comments in JDK-8205399 if the 
> underlying problem can cause problems for end users running w/assertions 
> disabled, or if it just results in sub-optimal performance



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13653) java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches

2019-07-25 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892983#comment-16892983
 ] 

Hoss Man commented on SOLR-13653:
-

sample failure from jenkins...

http://fucit.org/solr-jenkins-reports/job-data/apache/Lucene-Solr-Tests-master/3453
{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom 
-Dtests.seed=B1CFC66C4378F63 -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=rn -Dtests.timezone=Africa/Kampala -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   18.5s J1 | TestCloudJSONFacetSKG.testRandom <<<
   [junit4]> Throwable #1: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at 
http://127.0.0.1:42227/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection:
 Expected mime type application/octet-stream but got text/html. 
   [junit4]> 
   [junit4]> 
   [junit4]> Error 500 Server Error
   [junit4]> 
   [junit4]> HTTP ERROR 500
   [junit4]> Problem accessing 
/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection/select. 
Reason:
   [junit4]> Server ErrorCaused 
by:java.lang.AssertionError
   [junit4]>at 
java.base/java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1896)
   [junit4]>at 
java.base/java.util.HashMap$TreeNode.putTreeVal(HashMap.java:2061)
   [junit4]>at java.base/java.util.HashMap.putVal(HashMap.java:633)
   [junit4]>at java.base/java.util.HashMap.put(HashMap.java:607)
   [junit4]>at 
org.apache.solr.search.LRUCache.put(LRUCache.java:201)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1449)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:568)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1484)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:398)
   [junit4]>at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:305)
   [junit4]>at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
   [junit4]>at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:2581)
{noformat}

As of dc8e9afff92f3ffc4081a2ecad5970eb09924a73 this seeds reproduces fairly 
reliably for me using...

{noformat}
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment 18.9 (build 11.0.3+7)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.3+7, mixed mode)
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10 
-Dtests.failfast=no -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom 
-Dtests.seed=B1CFC66C4378F63 -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=rn -Dtests.timezone=Africa/Kampala -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
...
   [junit4] Tests with failures [seed: B1CFC66C4378F63]:
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4]   - org.apache.solr.search.facet.TestCloudJSONFacetSKG.testRandom
   [junit4] 
   [junit4] 
   [junit4] JVM J0: 1.38 ..   126.31 =   124.93s
   [junit4] JVM J1: 1.39 ..   128.17 =   126.77s
   [junit4] JVM J2: 1.35 ..   123.50 =   122.15s
   [junit4] Execution time total: 2 minutes 8 seconds
   [junit4] Tests summary: 10 suites, 10 tests, 8 errors

BUILD FAILED
{noformat}


> java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches
> ---
>
> Key: SOLR-13653
> URL: https://issues.apache.org/jira/browse/SOLR-13653
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
>  Labels: Java10, java11
>
> we've seen some java11 jenkins builds that have failed due to an 
> AssertionError being thrown by HashMap.put as used in LRUCache -- in at least 
> one case these failures are semi-reproducible. (the occasional "success" is 
> likely due to some unpredictibility in thread contention)
> Some cursory investigation 

[jira] [Created] (SOLR-13653) java (10 & 11) HashMap bug can trigger AssertionError when using SolrCaches

2019-07-25 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13653:
---

 Summary: java (10 & 11) HashMap bug can trigger AssertionError 
when using SolrCaches
 Key: SOLR-13653
 URL: https://issues.apache.org/jira/browse/SOLR-13653
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man


we've seen some java11 jenkins builds that have failed due to an AssertionError 
being thrown by HashMap.put as used in LRUCache -- in at least one case these 
failures are semi-reproducible. (the occasional "success" is likely due to some 
unpredictibility in thread contention)

Some cursory investigation suggests that JDK-8205399, first identified in 
java10, and fixed in java12-b26...

https://bugs.openjdk.java.net/browse/JDK-8205399

There does not appear to be anything we can do to mitigate this problem in 
Solr.  
It's also not clear to me based on th comments in JDK-8205399 if the underlying 
problem can cause problems for end users running w/assertions disabled, or if 
it just results in sub-optimal performance



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-07-25 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892894#comment-16892894
 ] 

Hoss Man commented on SOLR-13399:
-

bq. (unless you mean we've generally moved to doing doc it as part of the 
initial commit? If so, I missed that.)

yes, that's the entire value add of keeping the ref-guide in the same repo as 
the source, and having it as part of the main build w/precommit.

we've been trying to move to having the "code release process" and the 
"ref-guide release process" be a single process, with a single vote -- and 
we're getting close -- but the main hold up is people who add features w/o docs 
and then forcing a scramble during the release process to back fill docs on new 
features.

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-07-24 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892304#comment-16892304
 ] 

Hoss Man commented on SOLR-13399:
-

Also: it's really not cool to be adding new end user features/params w/o at 
least adding a one line summary of the new param to the relevant ref-guide page.

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-13399) compositeId support for shard splitting

2019-07-24 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened SOLR-13399:
-


Since the new SplitByPrefixTest was committed as part of this jira, it has fail 
a little over 5% of the time it's been run by jenkins -- on both master and 
branch_8x.

All of these failures occur at the same {{assertTrue(slice1 != slice2)}} call 
(SplitByPrefixTest.java:222) and all of the seeds i've tested appear to 
reproduce reliably...


on master...

{noformat}
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10 
-Dtests.failfast=no  -Dtestcase=SplitByPrefixTest -Dtests.method=doTest 
-Dtests.seed=4A09C6784BF1B28F -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=ar-YE -Dtests.timezone=MET -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
...
   [junit4] Tests summary: 10 suites, 10 tests, 10 failures
...
hossman@tray:~/lucene/dev/solr/core [j11] [master] $ ant test -Dtests.dups=10 
-Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest 
-Dtests.seed=75D9C45CAC5D0D22 -Dtests.slow=true -Dtests.locale=yo-BJ 
-Dtests.timezone=Africa/Porto-Novo -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
...
   [junit4] Tests summary: 10 suites, 10 tests, 10 failures
...
{noformat}

On branch_8x...

{noformat}
hossman@tray:~/lucene/dev/solr/core [j8] [branch_8x] $ ant test -Dtests.dups=10 
-Dtests.failfast=no -Dtestcase=SplitByPrefixTest -Dtests.method=doTest 
-Dtests.seed=B980178A30F46BB3 -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=ko-KR -Dtests.timezone=Africa/Abidjan -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
...
   [junit4] Tests summary: 10 suites, 10 tests, 10 failures
...
{noformat}


> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13637) Enable loading of plugins from the corecontainer memclassloader

2019-07-24 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892277#comment-16892277
 ] 

Hoss Man commented on SOLR-13637:
-

git bisect has identified 631edee1cba00d7fa41ac6e8c597a467db56346d as the cause 
of a recent spike in reproducible BasicAuthIntegrationTest jenkins failures on 
master. (Similar failures have been observed on branch_8x as well but i have 
not bisected those)

FWIW: Mikhail did some initial investigation into these failures in SOLR-13545 
due to initial speculation that that issue caused the failures, and noted that 
they seemed to corrispond to the randomization of using V2 API calls.

nature of failures...
{noformat}
   [junit4]   2> 51548 ERROR (qtp206015367-815) [n:127.0.0.1:43556_solr ] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: No 
contentStream
   [junit4]   2>at 
org.apache.solr.handler.admin.SecurityConfHandler.doEdit(SecurityConfHandler.java:103)
   [junit4]   2>at 
org.apache.solr.handler.admin.SecurityConfHandler.handleRequestBody(SecurityConfHandler.java:85)
   [junit4]   2>at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
   [junit4]   2>at 
org.apache.solr.api.ApiBag$ReqHandlerToApi.call(ApiBag.java:247)
   [junit4]   2>at 
org.apache.solr.api.V2HttpCall.handleAdmin(V2HttpCall.java:341)
   [junit4]   2>at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:786)
   [junit4]   2>at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:546)
...
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=BasicAuthIntegrationTest -Dtests.method=testBasicAuth 
-Dtests.seed=B292FDDCA6F4D6F2 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=lb-LU -Dtests.timezone=Pacific/Easter -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 5.21s J2 | BasicAuthIntegrationTest.testBasicAuth <<<
   [junit4]> Throwable #1: java.lang.AssertionError: expected:<401> but 
was:<400>
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([B292FDDCA6F4D6F2:EFC8BCE02A75588]:0)
   [junit4]>at 
org.apache.solr.security.BasicAuthIntegrationTest.testBasicAuth(BasicAuthIntegrationTest.java:151)
{noformat}
...note the seed mentioned in the reproduce line. that's an example of a seed 
that fails 100% reliably (on my machine) on master as of both HEAD and 
631edee1cba00d7fa41ac6e8c597a467db56346d, but does not fail on the previous 
commit ( 7d716f11075f0868535c108b21256a3b91b4a154 )  There are other dozens of 
seeds from recent jenkins failures that reliably reproduce in the same way 
(NOTE: i did not bisect test them all, but i did manually test a few of them 
against 631edee1cba00d7fa41ac6e8c597a467db56346d and 
7d716f11075f0868535c108b21256a3b91b4a154)

Considering how many "addressing test failures" commits you've had to make as 
part of this issue just to address the TestContainerReqHandler failures it 
introduced, not to mention these BasicAuthIntegrationTest we've now identified, 
I would strongly urge you to please:
 # *COMPLETLEY* revert all commits made as part of this issue to date
 # refrain from re-committing any related changes until you have a chance to 
beast all tests with multiple seeds, since this clear impacts the entire code 
base in ways you evidently didn't anticipate
 # once you have a unified set of changes w/working tests, re-commit only to 
master, and give it a few days to ensure no related failures, before 
backporting to branch_8x

> Enable loading of plugins from the corecontainer memclassloader
> ---
>
> Key: SOLR-13637
> URL: https://issues.apache.org/jira/browse/SOLR-13637
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we update jars or add/modify plugins no core reloading should be 
> required .Core reloading is a very expensive operation. Optionally, we can 
> just make the plugin depend on the corecontainer level classloader. 
> {code:xml}
>   runtimeLib="global">
>  
>   
> {code}
> or alternately using the config API 
> {code:json}
> curl -X POST -H 'Content-type:application/json' --data-binary '
> {
>   "create-queryparser": {
>   "name": "mycustomQParser" ,
>   "class" : "my.path.to.ClassName",
>  "runtimeLib" : "global"
>   }
> }' http://localhost:8983/api/c/mycollection/config
> {code}
> The global classloader is the corecontainer level classloader . So whenever 
> this is reloaded The component gets reloaded. The only caveat is, this 
> component cannot use core specific jars.
> We will deprecate the {{runtimeLib = true/false}} option and the 

[jira] [Commented] (SOLR-13579) Create resource management API

2019-07-24 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892241#comment-16892241
 ] 

Hoss Man commented on SOLR-13579:
-

I spent some time breifly skimming the patch, and TBH got lost very quickly.

I think it would be helpful (probably to more folks then just myself) if we 
could discuss, in "story" form, some (existing or hypothetical) examples of 
scenerios that could come up; how this new system would be helpful & behave in 
those scenerios, and what classes/objects (either in this patch, or yet to be 
written) would be responsible for each bit of action/reaction in those stories.

ie: I'm a solr cluster admin and I have some existing collections using the 
(existing) default cache configurations. When/why might i want to setup some 
pools? what types of steps would i take to do so? how would my configuration(s) 
change? After i have some pools in place, what's an example of something that 
might happen during runtime that would cause the ResourceManager to "do 
something" with my pools/caches? what would that "do something" look like in 
terms of method call stacks?  what would the effective end result be from my 
perspective as an external observer?

Some specific bits that confuse me as i try to wrap my head around the current 
patch...
 * If each named "pool" has exactly one ResourceManagerPlugin that contains the 
(type specific) actual logic for managinging "the pool" (and the resources 
using that pool) then why is the "ResourceManagerPool" class different from the 
"ResourceManagerPlugin" class?
 ** as opposed to combining that logic into a single common base class?
 ** is there a one-to-many/many-to-one relationship between them that i'm not 
understanding?

 * can you elaborate on this comment with some concrete examples:
{quote}Each managed resource can be managed by multiple types of plugins and it 
may appear in multiple pools (of different types). This reflects the fact that 
a single component may have multiple aspects of resource management - eg. cache 
mgmt, cpu, threads, etc.
{quote}
 ** ie: if "CacheManagerPlugin.TYPE" is one "type" of pool that a SolrCache 
(implements ManagedResource) might be managed by, what would another 
hypothetical "type" of plugin/pool be that SolrCache might also be a part of?
 *** or if you can't think of a good example of two diff types that a SolrCache 
would be managed by, any example of an concept/object in solr that might becom 
a "ManagedResource" that could be managed by two differnt types of polugins as 
part of 2 diff pools would be helpful
 ** What happens if a single ManagedResource is part of two different "pools" 
with two different ResourceManagerPlugins that give conflicting/overlapping 
instructions?

 * regarding this comment...
{quote}Each pool also has plugin-specific parameters, most notably the limits - 
eg. max total cache size, which the CacheManagerPlugin knows how to use in 
order to adjust cache sizes.
{quote}
 ** does that imply that once SolrCache(s) are part of a "pool" they no longer 
have their own max size(s) ? or is the configured max size of an individual 
cache(s) still a hard upper bound on the "managed size" that might be set at 
runtime as the plugins fire?
 ** how/where would someone specify a "preference" for ensuring that if a 
"pool" is "full" that certain resources should be managed more agressively then 
others – ex: imagine a cluster admin wants all collections to have SolrCaches 
that are "as big as possible" given the resources of the machines, but wants to 
give priority to a certain subset of the "important" collections if resources 
get constrained; what/where would that be done?


Also, FYI: with this patch, we now have 2 "ManagedResource" classes in 
solr/core that have absolutely nothing to do with each other...
{noformat}
$ find -name ManagedResource.java
./solr/core/src/java/org/apache/solr/rest/ManagedResource.java
./solr/core/src/java/org/apache/solr/managed/ManagedResource.java
{noformat}
...thats a little weird.

> Create resource management API
> --
>
> Key: SOLR-13579
> URL: https://issues.apache.org/jira/browse/SOLR-13579
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-13579.patch, SOLR-13579.patch, SOLR-13579.patch, 
> SOLR-13579.patch, SOLR-13579.patch
>
>
> Resource management framework API supporting the goals outlined in SOLR-13578.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13375) 2 Dimensional Routed Aliases

2019-07-18 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888371#comment-16888371
 ] 

Hoss Man commented on SOLR-13375:
-


The new DimensionalRoutedAliasUpdateProcessorTest appears to have some reliably 
reproducible bugs.

In the last 24 hours...
{quote}
Class: 
org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest
Method: testCatTime
Failures: 31.43% (11 / 35)
* thetaphi/Lucene-Solr-master-Windows/8059 (x6)
* apache/Lucene-Solr-repro/3446 (x5)
{quote}

The seeds from both of those jenkins jobs reproduce for me locally (first 
try)...
{noformat}
$ ant test  -Dtestcase=DimensionalRoutedAliasUpdateProcessorTest 
-Dtests.method=testCatTime -Dtests.seed=DD5BB3E097BBD0B4 -Dtests.slow=true 
-Dtests.badapples=true -Dtests.locale=br -Dtests.timezone=Asia/Dushanbe 
-Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1

$ ant test  -Dtestcase=DimensionalRoutedAliasUpdateProcessorTest 
-Dtests.method=testCatTime -Dtests.seed=21AAE082AEE603F0 -Dtests.multiplier=2 
-Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true 
-Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-8.x/test-data/enwiki.random.lines.txt
 -Dtests.locale=it -Dtests.timezone=Pacific/Port_Moresby -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
{noformat}

more failures from the past 7 days...
{quote}

Class: 
org.apache.solr.update.processor.DimensionalRoutedAliasUpdateProcessorTest
Method: testCatTime
Failures: 18.95% (29 / 153)

* apache/Lucene-Solr-repro/3446 (x5)
* sarowe/Lucene-Solr-reproduce-failed-tests/8776
* thetaphi/Lucene-Solr-8.x-Windows/371 (x6)
* thetaphi/Lucene-Solr-master-Windows/8059 (x6)
* sarowe/Lucene-Solr-tests-master/21358
* sarowe/Lucene-Solr-tests-master/21339
* sarowe/Lucene-Solr-tests-master/21355
* apache/Lucene-Solr-NightlyTests-8.x/153
* thetaphi/Lucene-Solr-8.x-Linux/879
* thetaphi/Lucene-Solr-8.x-Solaris/240 (x6)
{quote}

{quote}

> 2 Dimensional Routed Aliases
> 
>
> Key: SOLR-13375
> URL: https://issues.apache.org/jira/browse/SOLR-13375
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Affects Versions: master (9.0)
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
> Attachments: SOLR-13375.patch, SOLR-13375.patch, SOLR-13375.patch, 
> SOLR-13375.patch, SOLR-13375.patch
>
>
> Current available routed aliases are restricted to a single field. This 
> feature will allow Solr to provide data driven collection access, creation 
> and management based on multiple fields in a document. The collections will 
> be queried and updated in a unified manner via an alias. Current routing is 
> restricted to the values of a single field. The particularly useful 
> combination at this time will be Category X Time routing but Category X 
> Category may also be useful. More importantly, if additional routing schemes 
> are created in the future (either as contributions or as custom code by 
> users) combination among these should be supported. 
> It is expected that not all combinations will be useful, and that 
> determination of usefulness I expect to leave up to the user. Some Routing 
> schemes may need to be limited to be the leaf/last routing scheme for 
> technical reasons, though I'm not entirely convinced of that yet. If so, a 
> flag will be added to the RoutedAlias interface.
> Initial desire is to support two levels, though if arbitrary levels can be 
> supported easily that will be done.
> This could also have been called CompositeRoutedAlias, but that creates a TLA 
> clash with CategoryRoutedAlias.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8908) Specified default value not returned for query() when doc doesn't match

2019-07-17 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887518#comment-16887518
 ] 

Hoss Man commented on LUCENE-8908:
--

[~munendrasn] at first glance this looks good ... but i'm wondering if this 
actually fixes all of the examples i mentioned when this was opened -- in 
particular things like {{exists(query($qx,0))}} vs {{exists(query($qx))}} ... 
those should return different things depending on whether the doc matches $qx 
or not. 

IIRC that would require modifying QueryDocValues.exists() to return "true" 
anytime there is a defVal, but i don't think that's really possible ATM because 
it's a {float}} (not a nullable Float) ... and i'm not sure off the top of my 
head that it would even be the ideal behavior for the QueryDocValues code ? ... 
 maybe the solr ValueSourceParser logic should be changed to put an explicit 
wrapper around the QueryValueSource when a default is (isn't?) used ... not 
sure, i haven't looked / thought about this code in a long time.

> Specified default value not returned for query() when doc doesn't match
> ---
>
> Key: LUCENE-8908
> URL: https://issues.apache.org/jira/browse/LUCENE-8908
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Bill Bell
>Priority: Major
> Attachments: LUCENE-8908.patch, SOLR-7845.patch, SOLR-7845.patch
>
>
> The 2 arg version of the "query()" was designed so that the second argument 
> would specify the value used for any document that does not match the query 
> pecified by the first argument -- but the "exists" property of the resulting 
> ValueSource only takes into consideration wether or not the document matches 
> the query -- and ignores the use of the second argument.
> 
> The work around is to ignore the 2 arg form of the query() function, and 
> instead wrap he query function in def().
> for example:  {{def(query($something), $defaultval)}} instead of 
> {{query($something, $defaultval)}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-16 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886513#comment-16886513
 ] 

Hoss Man commented on LUCENE-8920:
--

[~sokolov] - your revert on branch_8_2 seems to have broken most of the 
lucene/analysis/kuromoji tests with a common root cause...

{noformat}
  [junit4] ERROR   0.44s J0 | TestFactories.test <<<
   [junit4]> Throwable #1: java.lang.ExceptionInInitializerError
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([B1B94D34D92CDA93:39ED72EE77D0B76B]:0)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62)
   [junit4]>at 
org.apache.lucene.analysis.ja.JapaneseTokenizer.(JapaneseTokenizer.java:215)
   [junit4]>at 
org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:150)
   [junit4]>at 
org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:82)
   [junit4]>at 
org.apache.lucene.analysis.ja.TestFactories$FactoryAnalyzer.createComponents(TestFactories.java:174)
   [junit4]>at 
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:427)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
   [junit4]>at 
org.apache.lucene.analysis.ja.TestFactories.doTestTokenizer(TestFactories.java:81)
   [junit4]>at 
org.apache.lucene.analysis.ja.TestFactories.test(TestFactories.java:60)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
   [junit4]> Caused by: java.lang.RuntimeException: Cannot load 
TokenInfoDictionary.
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.(TokenInfoDictionary.java:71)
   [junit4]>... 46 more
   [junit4]> Caused by: org.apache.lucene.index.IndexFormatTooNewException: 
Format version is not supported (resource 
org.apache.lucene.store.InputStreamDataInput@5f0dbb2f): 7 (needs to be between 
6 and 6)
   [junit4]>at 
org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216)
   [junit4]>at 
org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198)
   [junit4]>at org.apache.lucene.util.fst.FST.(FST.java:275)
   [junit4]>at org.apache.lucene.util.fst.FST.(FST.java:263)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:47)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:54)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.(TokenInfoDictionary.java:32)
   [junit4]>at 
org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.(TokenInfoDictionary.java:69)
   [junit4]>... 46 more

{noformat}

...perhaps due to "conflicting reverts" w/ LUCENE-8907 / LUCENE-8778 ?
/cc [~tomoko]

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-13534) Dynamic loading of jars from a url

2019-07-15 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reopened SOLR-13534:
-

Noble: the new TestDynamicLoadingUrl can fail easily on heavily loaded machines 
when loading the "new" SolrCore instances (that result from reloading all cores 
after a config change) may not complete prior to the subsequent REST calls that 
depend on them

In many recent jenkins failures this even happens as a result of the first 
SolrCore reload needed when executing the {{'create-requesthandler'}} config 
command to register the {{/jarhandler}} used by the test to "fake" a remote 
server.

Here's an example of the order of the "going to send config command" logging 
from the test compared to the "CLOSING SolrCore" logging (indicating that the 
_new_ reloaded version of the core is already live and ready for requests) in a 
recent jenkins failure...
{noformat}
$ sed -n -e '/o.a.s.c.TestSolrConfigHandler/,+1p' -e '/config update listener 
called/p' -e '/CLOSING SolrCore/p' 
thetaphi_Lucene-Solr-master-MacOSX_5259.log.txt
   [junit4]   2> 549681 INFO  
(TEST-TestDynamicLoadingUrl.testDynamicLoadingUrl-seed#[B56AB6A699945C60]) [
 ] o.a.s.c.TestSolrConfigHandler going to send config command. path /config , 
payload: {
   [junit4]   2> 'create-requesthandler' : { 'name' : '/jarhandler', 'class': 
org.apache.solr.core.TestDynamicLoadingUrl$JarHandler, registerPath: 
'/solr,/v2' }
   [junit4]   2> 549688 INFO  (Thread-2015) [ ] o.a.s.c.SolrCore config 
update listener called for core collection1_shard2_replica_n1
   [junit4]   2> 549688 INFO  (Thread-2014) [ ] o.a.s.c.SolrCore config 
update listener called for core collection1_shard1_replica_n2
   [junit4]   2> 549689 INFO  (Thread-2013) [ ] o.a.s.c.SolrCore config 
update listener called for core control_collection_shard1_replica_n1
   [junit4]   2> 549689 INFO  (Thread-2017) [ ] o.a.s.c.SolrCore config 
update listener called for core collection1_shard2_replica_n5
   [junit4]   2> 549689 INFO  (Thread-2016) [ ] o.a.s.c.SolrCore config 
update listener called for core collection1_shard1_replica_n6
   [junit4]   2> 550916 INFO  (Thread-2013) [n:127.0.0.1:56493_m 
c:control_collection s:shard1 r:core_node2 
x:control_collection_shard1_replica_n1 ] o.a.s.c.SolrCore 
[control_collection_shard1_replica_n1]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@3bceb4b6
   [junit4]   2> 550921 INFO  (Thread-2015) [n:127.0.0.1:56521_m c:collection1 
s:shard2 r:core_node3 x:collection1_shard2_replica_n1 ] o.a.s.c.SolrCore 
[collection1_shard2_replica_n1]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@724097d0
   [junit4]   2> 550986 INFO  (qtp1160063522-12178) [n:127.0.0.1:56525_m 
c:collection1 s:shard1 r:core_node4 x:collection1_shard1_replica_n2 ] 
o.a.s.c.SolrCore [collection1_shard1_replica_n2]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@6972fc14
   [junit4]   2> 551100 INFO  
(TEST-TestDynamicLoadingUrl.testDynamicLoadingUrl-seed#[B56AB6A699945C60]) [
 ] o.a.s.c.TestSolrConfigHandler going to send config command. path /config , 
payload: {
   [junit4]   2> 'add-runtimelib' : { 'name' : 'urljar', url : 
'http://127.0.0.1:56525/m/collection1/jarhandler?wt=filestream'  
'sha512':'e01b51de67ae1680a84a813983b1de3b592fc32f1a22b662fc9057da5953abd1b72476388ba342cad21671cd0b805503c78ab9075ff2f3951fdf75fa16981420'}}
   [junit4]   2> 55 INFO  
(TEST-TestDynamicLoadingUrl.testDynamicLoadingUrl-seed#[B56AB6A699945C60]) [
 ] o.a.s.c.TestSolrConfigHandler going to send config command. path /config , 
payload: {
   [junit4]   2> 'add-runtimelib' : { 'name' : 'urljar', url : 
'http://127.0.0.1:56531/m/collection1/jarhandler?wt=filestream'  
'sha512':'d01b51de67ae1680a84a813983b1de3b592fc32f1a22b662fc9057da5953abd1b72476388ba342cad21671cd0b805503c78ab9075ff2f3951fdf75fa16981420'}}
   [junit4]   2> 551126 INFO  (Thread-2016) [n:127.0.0.1:56536_m c:collection1 
s:shard1 r:core_node8 x:collection1_shard1_replica_n6 ] o.a.s.c.SolrCore 
[collection1_shard1_replica_n6]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@3ff8c0a0
   [junit4]   2> 551145 INFO  (Thread-2017) [n:127.0.0.1:56531_m c:collection1 
s:shard2 r:core_node7 x:collection1_shard2_replica_n5 ] o.a.s.c.SolrCore 
[collection1_shard2_replica_n5]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@11bde630
   [junit4]   2> 551232 INFO  (coreCloseExecutor-4027-thread-1) 
[n:127.0.0.1:56493_m c:control_collection s:shard1 r:core_node2 
x:control_collection_shard1_replica_n1 ] o.a.s.c.SolrCore 
[control_collection_shard1_replica_n1]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@3c0a3773
   [junit4]   2> 551234 INFO  (coreCloseExecutor-4026-thread-1) 
[n:127.0.0.1:56521_m c:collection1 s:shard2 r:core_node3 
x:collection1_shard2_replica_n1 ] o.a.s.c.SolrCore 
[collection1_shard2_replica_n1]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@735e18b7
   [junit4]   2> 551235 INFO  

[jira] [Updated] (SOLR-13627) newly created collection can see all replicas go into recovery immediately on first document addition

2019-07-13 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13627:

Attachment: apache_Lucene-Solr-NightlyTests-8.x_143.log.txt
Status: Open  (was: Open)


I'm attaching the logs from {{apache_Lucene-Solr-NightlyTests-8.x_143.log.txt}} 
with the full test logs, but the gist of the situation is that these lines of 
code...

{code}
// NOTE: legacyCloud == false

CollectionAdminRequest.setClusterProperty(ZkStateReader.LEGACY_CLOUD, 
legacyCloud).process(cluster.getSolrClient());
final String collectionName = "deleteFromClusterState_"+legacyCloud;
CollectionAdminRequest.createCollection(collectionName, "conf", 1, 3)
.process(cluster.getSolrClient());

cluster.waitForActiveCollection(collectionName, 1, 3);

cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "1"));
cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "2"));
cluster.getSolrClient().commit(collectionName);
{code}

Results in the following key bits of logging...

{noformat}

# set the cluster prop...

   [junit4]   2> 4266069 INFO  (qtp1645118155-93196) [n:127.0.0.1:38694_solr
 ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :clusterprop with 
params val=false=legacyCloud=CLUSTERPROP=javabin=2 and 
sendToOCPQueue=true

#
# Collection creation, w/leader election...
#

   [junit4]   2> 4266070 INFO  (qtp1645118155-93196) [n:127.0.0.1:38694_solr
 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 
params={val=false=legacyCloud=CLUSTERPROP=javabin=2} 
status=0 QTime=1
   [junit4]   2> 4266071 INFO  (qtp1645118155-93197) [n:127.0.0.1:38694_solr
 ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with params 
collection.configName=conf=deleteFromClusterState_false=3=CREATE=1=javabin=2
 and sendToOCPQueue=true

...

   [junit4]   2> 4266757 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr
x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.h.a.CoreAdminOperation 
core create command 
qt=/admin/cores=core_node5=conf=true=deleteFromClusterState_false_shard1_replica_n2=CREATE=1=deleteFromClusterState_false=shard1=javabin=2=NRT
   [junit4]   2> 4266814 INFO  (qtp1240928045-93187) [n:127.0.0.1:44654_solr
x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.h.a.CoreAdminOperation 
core create command 
qt=/admin/cores=core_node6=conf=true=deleteFromClusterState_false_shard1_replica_n3=CREATE=1=deleteFromClusterState_false=shard1=javabin=2=NRT
   [junit4]   2> 4266832 INFO  (qtp1645118155-93199) [n:127.0.0.1:38694_solr
x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.h.a.CoreAdminOperation 
core create command 
qt=/admin/cores=core_node4=conf=true=deleteFromClusterState_false_shard1_replica_n1=CREATE=1=deleteFromClusterState_false=shard1=javabin=2=NRT

...
   [junit4]   2> 4268946 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr 
c:deleteFromClusterState_false s:shard1 r:core_node5 
x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkShardTerms 
Successful update of terms at 
/collections/deleteFromClusterState_false/terms/shard1 to 
Terms{values={core_node5=0}, version=0}
...
   [junit4]   2> 4269040 INFO  (qtp1240928045-93187) [n:127.0.0.1:44654_solr 
c:deleteFromClusterState_false s:shard1 r:core_node6 
x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms Failed 
to save terms, version is not a match, retrying
   [junit4]   2> 4269040 INFO  (qtp1645118155-93199) [n:127.0.0.1:38694_solr 
c:deleteFromClusterState_false s:shard1 r:core_node4 
x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.c.ZkShardTerms 
Successful update of terms at 
/collections/deleteFromClusterState_false/terms/shard1 to 
Terms{values={core_node4=0, core_node5=0}, version=1}
   [junit4]   2> 4269040 INFO  (qtp1645118155-93199) [n:127.0.0.1:38694_solr 
c:deleteFromClusterState_false s:shard1 r:core_node4 
x:deleteFromClusterState_false_shard1_replica_n1 ] 
o.a.s.c.ShardLeaderElectionContextBase make sure parent is created 
/collections/deleteFromClusterState_false/leaders/shard1
...
   [junit4]   2> 4269106 INFO  (qtp1240928045-93187) [n:127.0.0.1:44654_solr 
c:deleteFromClusterState_false s:shard1 r:core_node6 
x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms 
Successful update of terms at 
/collections/deleteFromClusterState_false/terms/shard1 to 
Terms{values={core_node6=0, core_node4=0, core_node5=0}, version=2}
...

   [junit4]   2> 4269487 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr 
c:deleteFromClusterState_false s:shard1 r:core_node5 
x:deleteFromClusterState_false_shard1_replica_n2 ] 
o.a.s.c.ShardLeaderElectionContext Enough replicas found to continue.
   [junit4]   2> 4269487 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr 
c:deleteFromClusterState_false s:shard1 r:core_node5 
x:deleteFromClusterState_false_shard1_replica_n2 ] 

[jira] [Created] (SOLR-13627) newly created collection can see all replicas go into recovery immediately on first document addition

2019-07-13 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13627:
---

 Summary: newly created collection can see all replicas go into 
recovery immediately on first document addition
 Key: SOLR-13627
 URL: https://issues.apache.org/jira/browse/SOLR-13627
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man



There's something very weird going on that popped up in a recent jenkins run of 
{{DeleteReplicaTest.deleteReplicaFromClusterState}}.  While the test has some 
issues of it's own, and ultimately failed due to a combination of SOLR-13616 + 
some sloppy assertions (which I will attempt address independently of this 
Jira) a more alarming situation is what the logs show at the _begining_ of the 
tests, before any problems occured.

In a nutshell: *Just the act of creating a 1x3 collection and adding some docs 
to it caused both of the non-leader replicas to immediatley decide they needed 
to go into recovery.*

Details to follow in comments...




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13532) Unable to start core recovery due to timeout in ping request

2019-07-11 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13532:

   Resolution: Fixed
 Assignee: Hoss Man
Fix Version/s: 8.3
   8.2
   master (9.0)
   Status: Resolved  (was: Patch Available)

Thanks Suril!

> Unable to start core recovery due to timeout in ping request
> 
>
> Key: SOLR-13532
> URL: https://issues.apache.org/jira/browse/SOLR-13532
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 7.6
>Reporter: Suril Shah
>Assignee: Hoss Man
>Priority: Major
> Fix For: master (9.0), 8.2, 8.3
>
> Attachments: SOLR-13532.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception 
> message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  
> (recoveryExecutor-4-thread-1-processing-n::8983_solr 
> x:_shard41_replica_n2777 c: s:shard41 
> r:core_node2778) x:_shard41_replica_n2777 
> o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr 
> on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period 
> which is hard-coded to one second in solr source code. However In a general 
> production setting it is common to have ping time more than one second, 
> hence, the core recovery never starts and exception is thrown.
>  * Also the other major concern is that this exception is logged as an info 
> message, hence it is very difficult to identify the error if info logging is 
> not enabled.
>  * Please refer to following code snippet from the [source 
> code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
>  to understand the above issue.
> {code:java}
>   try (HttpSolrClient httpSolrClient = new 
> HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>   .withSocketTimeout(1000)
>   .withConnectionTimeout(1000)
>   
> .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>   .build()) {
> SolrPingResponse resp = httpSolrClient.ping();
> return leaderReplica;
>   } catch (IOException e) {
> log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
> Thread.sleep(500);
>   } catch (Exception e) {
> if (e.getCause() instanceof IOException) {
>   log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>   Thread.sleep(500);
> } else {
>   return leaderReplica;
> }
>   }
> {code}
> The above issue will have high impact in production level clusters, since 
> cores not being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping 
> request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
>  in *RecoveryStrategy.java* should be configurable and the defaults set to 
> high values like 15seconds.
>  2. The exception message in [line 
> 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
>  and [line 
> 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
>  in *RecoveryStrategy.java* should be logged as *error* messages instead of 
> *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

2019-07-11 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883157#comment-16883157
 ] 

Hoss Man commented on SOLR-13616:
-

not sure why/how gitbox missed the 8x cherry-pick: 
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=81b2e06ffe6bddcd8d25b24c79683281da85baee

> Possible racecondition/deadlock between collection DELETE and PrepRecovery ? 
> (TestPolicyCloud failures)
> ---
>
> Key: SOLR-13616
> URL: https://issues.apache.org/jira/browse/SOLR-13616
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-13616.test-incomplete.patch, 
> thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is 
> a possible deadlock condition when attempting to delete a collection while 
> recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but 
> it does not appear to be a test specific problem, and seems like it could 
> potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

2019-07-10 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882245#comment-16882245
 ] 

Hoss Man commented on SOLR-13616:
-

{quote}I'm not sure we should change the waitForState logic to rethrow 
Exceptions or revert back PrepRecoveryOp to its previous version ...
{quote}
{quote}Hoss and Dat – thank you for investigating this! All usages of 
CollectionStateWatcher or LiveNodesWatcher will suffer from this problem i.e. 
the thread that runs the watcher swallows the exception ...
{quote}
Well, generally speaking there isn't any way (i can think of) for the thread 
executing a Watcher to do anything _but_ swallow any exceptions from the 
watcher – it can't propogated it back to the "caller" of registrWatcher or 
anything like that .. if the caller wanted to be informed then the Watcher it 
registered should be catching the exceptions itself.

But to Dat's point: in the specific case of {{waitForState}} – there 
ZkStateReader *is* creating it's own Watcher to wrap the input Predicate, and 
we could in fact make waitForState do something inside that Watcher that 
catches any Exception thrown by the Predicate and short circuts out of the 
{{waitForState}} call, wrapping/re-throwing the exception in the meantime.

But those seem like "broader" problems with regards to where/how the different 
callers are using the Watcher/waitForState APIs that we should probably create 
a new issue to track (for auditing all of them and clarifying the behavior in 
the javadocs) ... frankly i think in this specific jira we should be asking a 
lot more questions about the _specific_ predicate used in PrepForRecovery's 
waitForState call ... notably what exactly is the expectation here when the 
SolrCore (that prepRecovery wants to recover from) can't be found _in the local 
CoreContainer_ ... deleting the collection is just one example, are there other 
situations where the core may not be found at this point in the code? (node 
shutdown perhaps? autoscaling removing a replica) ?

what about a few lines later...
{code:java}
  if (onlyIfLeader != null && onlyIfLeader) {
if (!core.getCoreDescriptor().getCloudDescriptor().isLeader()) {
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "We 
are not the leader");
}
  }
{code}
...even if the SolrCore is found, if we expect it to be the shard leader, and 
it's not (what if there has beena leader election in the meantime?) then that's 
another type of problem that will also cause the predicate to throw an 
exception that will (aparently) cause PrepRecovery to stall. what should 
PrepRecovery do here?

i suspect that in general the use of waitForState here in PrepRecoveryOp is "ok 
in concept" ... we just need to make the predicate smarter about exiting 
immeidately in these situations instead of throwing an exception that gets 
swallowed ... i'm just not sure what the right behavior for PrepRecovery *is* 
in these sitautions.

I don't suppose either of you were able to spot what's "wrong" with my test 
that it doesn't force a failure in this situation?

> Possible racecondition/deadlock between collection DELETE and PrepRecovery ? 
> (TestPolicyCloud failures)
> ---
>
> Key: SOLR-13616
> URL: https://issues.apache.org/jira/browse/SOLR-13616
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-13616.test-incomplete.patch, 
> thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is 
> a possible deadlock condition when attempting to delete a collection while 
> recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but 
> it does not appear to be a test specific problem, and seems like it could 
> potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12368) in-place DV updates should no longer have to jump through hoops if field does not yet exist

2019-07-10 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882226#comment-16882226
 ] 

Hoss Man commented on SOLR-12368:
-

Hey [~munendrasn] - patch functionality looks good to me, but i'm confused by 
your last comment...

bq. As part of this issue, I will commit only solr changes and raise lucene 
issue for deprecating and removing IndexWriter#getFieldNames

...but in your latest patch, IndexWriter.getFieldNames (and the underlying 
FieldInfos method) are still being removed ... shouldn't those be moved to a 
new (linked) issue/patch so that the commit for _this_ issue can be trivially 
backported?


> in-place DV updates should no longer have to jump through hoops if field does 
> not yet exist
> ---
>
> Key: SOLR-12368
> URL: https://issues.apache.org/jira/browse/SOLR-12368
> Project: Solr
>  Issue Type: Improvement
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12368.patch, SOLR-12368.patch, SOLR-12368.patch
>
>
> When SOLR-5944 first added "in-place" DocValue updates to Solr, one of the 
> edge cases thta had to be dealt with was the limitation imposed by 
> IndexWriter that docValues could only be updated if they already existed - if 
> a shard did not yet have a document w/a value in the field where the update 
> was attempted, we would get an error.
> LUCENE-8316 seems to have removed this error, which i believe means we can 
> simplify & speed up some of the checks in Solr, and support this situation as 
> well, rather then falling back on full "read stored fields & reindex" atomic 
> update



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

2019-07-09 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13616:

Attachment: SOLR-13616.test-incomplete.patch
thetaphi_Lucene-Solr-master-Linux_24358.log.txt
Status: Open  (was: Open)

Unlike most tests that explicitly waitFor/assert *active* replicas, 
TestPolicyCloud (currently) has several tests that only assert the quanity and 
location of a replica – it doesn't wait for them to become active, so when 
testing an ADDREPLICA or a SPLITSHARD, those new replicas are still in recovery 
(or PrepRecovery) when the test tries to do cleanup and delete the collection – 
which frequently fails with timeout problems.

While we can certainly "improve" TestPolicyCloud to wait for recoveries to 
finish, and all replicas to be active before attempting to delete the 
collection, a better question is why this is needed?

I'm attaching {{thetaphi_Lucene-Solr-master-Linux_24358.log.txt}} which 
demonstrates the problem in {{TestPolicyCloud.testCreateCollectionAddReplica}} 
here are some highlights...
{noformat}
# thetaphi_Lucene-Solr-master-Linux_24358.log.txt
#
# testCreateCollectionAddReplica

# bulk of test logic is finished, test has added a replica and confirmed it's 
on the expected node
# 
# but meanwhile, recovery is still ongoing...

  [junit4]   2> 959699 INFO  
(recoveryExecutor-5888-thread-1-processing-n:127.0.0.1:42097_solr 
x:testCreateCollectionAddReplica_shard1_replica_n3 
c:testCreateCollectionAddReplica s:shard1 r:core_node4) [n:127.0.0.1:42097_solr 
c:testCreateCollectionAddReplica s:shard1 r:core_node4 
x:testCreateCollectionAddReplica_shard1_replica_n3 ] o.a.s.c.RecoveryStrategy 
Sending prep recovery command to [https://127.0.0.1:42097/solr]; [WaitForState: 
action=PREPRECOVERY=testCreateCollectionAddReplica_shard1_replica_n1=127.0.0.1:42097_solr=core_node4=recovering=true=true=true]
   [junit4]   2> 959701 INFO  (qtp531873617-17025) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp 
Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true, 
onlyIfLeader: true, onlyIfLeaderActive: true
   [junit4]   2> 959701 INFO  (qtp531873617-17025) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp 
In WaitForState(recovering): collection=testCreateCollectionAddReplica, 
shard=shard1, thisCore=testCreateCollectionAddReplica_shard1_replica_n1, 
leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, 
currentState=down, localState=active, nodeName=127.0.0.1:42097_solr, 
coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{
   [junit4]   2>   "core":"testCreateCollectionAddReplica_shard1_replica_n3",
   [junit4]   2>   "base_url":"https://127.0.0.1:42097/solr;,
   [junit4]   2>   "state":"down",
   [junit4]   2>   "node_name":"127.0.0.1:42097_solr",
   [junit4]   2>   "type":"NRT"}
   ...

# the test thread moves on to @After method which calls 
MiniSolrCloudCluster.deleteAllCollections() ...
   ...
   [junit4]   2> 959703 INFO  (qtp531873617-17021) [n:127.0.0.1:42097_solr 
] o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params 
name=testCreateCollectionAddReplica=DELETE=javabin=2 and 
sendToOCPQueue=true
   ...
   [junit4]   2> 959709 INFO  
(OverseerThreadFactory-5345-thread-5-processing-n:127.0.0.1:44991_solr) 
[n:127.0.0.1:44991_solr ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Executing Collection 
Cmd=action=UNLOAD=true=true=true,
 asyncId=null
   ...
   [junit4]   2> 959750 INFO  (qtp531873617-17325) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.c.SolrCore 
[testCreateCollectionAddReplica_shard1_replica_n1]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@39444e66
   ...
   [junit4]   2> 959753 INFO  (qtp531873617-17325) [n:127.0.0.1:42097_solr
x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/cores 
params={deleteInstanceDir=true=true=testCreateCollectionAddReplica_shard1_replica_n1=/admin/cores=true=UNLOAD=javabin=2}
 status=0 QTime=17


# but meanwhile, PrepRecoveryOp is currently blocked on a call to 
ZkStateReader.waitForState
# looking for specific conditions for the leader and (new) replica (that needs 
to recover)
# ... BUT!... the leader core has already been closed, so the watcher never 
succeeds,
# ...so PrepRecovery keeps waitForState ...

   [junit4]   2> 959855 WARN  (watches-5915-thread-1) [ ] 
o.a.s.c.c.ZkStateReader Error on calling watcher
   [junit4]   2>   => org.apache.solr.common.SolrException: core not 
found:testCreateCollectionAddReplica_shard1_replica_n1
   [junit4]   2>at 
org.apache.solr.handler.admin.PrepRecoveryOp.lambda$execute$0(PrepRecoveryOp.java:83)
   [junit4]   2> 

[jira] [Created] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

2019-07-09 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13616:
---

 Summary: Possible racecondition/deadlock between collection DELETE 
and PrepRecovery ? (TestPolicyCloud failures)
 Key: SOLR-13616
 URL: https://issues.apache.org/jira/browse/SOLR-13616
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man



Based on some recent jenkins failures in TestPolicyCloud, I suspect there is a 
possible deadlock condition when attempting to delete a collection while 
recovery is in progress.

I haven't been able to identify exactly where/why/how the problem occurs, but 
it does not appear to be a test specific problem, and seems like it could 
potentially affect anyone unlucky enough to issue poorly timed DELETE.

Details to follow in comments...




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

2019-07-08 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880481#comment-16880481
 ] 

Hoss Man commented on SOLR-13599:
-

this is the epitome of a heisenbug ... 

5 days ago i commit a change to master that adds a bit of extra logging to the 
test, and since then there hasn't been a single master fail -- but in the same 
about of time, 7/10 of the 8x builds have failed, and all but one of those 
reproduced 3x (or more) times.

not sure what to do here except backport the loging changes to 8x, and hope we 
get another failure eventaully so we'll have something to diagnose.


> ReplicationFactorTest high failure rate on Windows jenkins VMs after 
> 2019-06-22 OS/java upgrades
> 
>
> Key: SOLR-13599
> URL: https://issues.apache.org/jira/browse/SOLR-13599
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: thetaphi_Lucene-Solr-master-Windows_8025.log.txt
>
>
> We've started seeing some weirdly consistent (but not reliably reproducible) 
> failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
> machines.
> The failures all seem to have started on June 22 -- when Uwe upgraded his 
> Windows VMs to upgrade the Java version, but happen across all versions of 
> java tested, and on both the master and branch_8x.
> While this test failed a total of 5 times, in different ways, on various 
> jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on 
> all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and 
> when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins 
> builds frequently fails anywhere from 1-4 additional times.
> All of these failures occur in the exact same place, with the exact same 
> assertion: that the expected replicationFactor of 2 was not achieved, and an 
> rf=1 (ie: only the master) was returned, when sending a _batch_ of documents 
> to a collection with 1 shard, 3 replicas; while 1 of the replicas was 
> partitioned off due to a closed proxy.
> In the handful of logs I've examined closely, the 2nd "live" replica does in 
> fact log that it recieved & processed the update, but with a QTime of over 30 
> seconds, and it then it immediately logs an 
> {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
> meanwhile, the leader has one ({{updateExecutor}} thread logging copious 
> amount of {{java.net.ConnectException: Connection refused: no further 
> information}} regarding the replica that was partitioned off, before a second 
> {{updateExecutor}} thread ultimately logs 
> {{java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
> replica.
> 
> What makes this perplexing is that this is not the first time in the test 
> that documents were added to this collection while one replica was 
> partitioned off, but it is the first time that all 3 of the following are 
> true _at the same time_:
> # the collection has recovered after some replicas were partitioned and 
> re-connected
> # a batch of multiple documents is being added
> # one replica has been "re" partitioned.
> ...prior to the point when this failure happens, only individual document 
> adds were tested while replicas where partitioned.  Batches of adds were only 
> tested when all 3 replicas were "live" after the proxies were re-opened and 
> the collection had fully recovered.  The failure also comes from the first 
> update to happen after a replica's proxy port has been "closed" for the 
> _second_ time.
> While this conflagration of events might concievible trigger some weird bug, 
> what makes these failures _particularly_ perplexing is that:
> * the failures only happen on Windows
> * the failures only started after the Windows VM update on June-22.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13532) Unable to start core recovery due to timeout in ping request

2019-07-03 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13532:

Status: Patch Available  (was: Open)

> Unable to start core recovery due to timeout in ping request
> 
>
> Key: SOLR-13532
> URL: https://issues.apache.org/jira/browse/SOLR-13532
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 7.6
>Reporter: Suril Shah
>Priority: Major
> Attachments: SOLR-13532.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception 
> message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  
> (recoveryExecutor-4-thread-1-processing-n::8983_solr 
> x:_shard41_replica_n2777 c: s:shard41 
> r:core_node2778) x:_shard41_replica_n2777 
> o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr 
> on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period 
> which is hard-coded to one second in solr source code. However In a general 
> production setting it is common to have ping time more than one second, 
> hence, the core recovery never starts and exception is thrown.
>  * Also the other major concern is that this exception is logged as an info 
> message, hence it is very difficult to identify the error if info logging is 
> not enabled.
>  * Please refer to following code snippet from the [source 
> code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
>  to understand the above issue.
> {code:java}
>   try (HttpSolrClient httpSolrClient = new 
> HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>   .withSocketTimeout(1000)
>   .withConnectionTimeout(1000)
>   
> .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>   .build()) {
> SolrPingResponse resp = httpSolrClient.ping();
> return leaderReplica;
>   } catch (IOException e) {
> log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
> Thread.sleep(500);
>   } catch (Exception e) {
> if (e.getCause() instanceof IOException) {
>   log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>   Thread.sleep(500);
> } else {
>   return leaderReplica;
> }
>   }
> {code}
> The above issue will have high impact in production level clusters, since 
> cores not being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping 
> request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
>  in *RecoveryStrategy.java* should be configurable and the defaults set to 
> high values like 15seconds.
>  2. The exception message in [line 
> 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
>  and [line 
> 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
>  in *RecoveryStrategy.java* should be logged as *error* messages instead of 
> *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13532) Unable to start core recovery due to timeout in ping request

2019-07-03 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13532:

Attachment: SOLR-13532.patch
Status: Open  (was: Open)

bq. The other alternative to this would be to update the {{RecoveryStrategy}} 
code to use something like {{cc.getConfig().getUpdateShardHandlerConfig()}} ...

Here's a variant of Suril's patch along those lines, with some refactoring to 
put the logic into a helper method.

I don't love it -- but i don't hate it either.

I'm still running tests to make sure i didn't break anything, but in the 
meantime what do folks think? ... can anyone see any problems with this 
approach?

([~surilshah]: does this patch -- and the usage of the solr.xml configures 
values instead of hardcoded magic constants -- solvethe problems you're seeing?)

> Unable to start core recovery due to timeout in ping request
> 
>
> Key: SOLR-13532
> URL: https://issues.apache.org/jira/browse/SOLR-13532
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 7.6
>Reporter: Suril Shah
>Priority: Major
> Attachments: SOLR-13532.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception 
> message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  
> (recoveryExecutor-4-thread-1-processing-n::8983_solr 
> x:_shard41_replica_n2777 c: s:shard41 
> r:core_node2778) x:_shard41_replica_n2777 
> o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr 
> on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period 
> which is hard-coded to one second in solr source code. However In a general 
> production setting it is common to have ping time more than one second, 
> hence, the core recovery never starts and exception is thrown.
>  * Also the other major concern is that this exception is logged as an info 
> message, hence it is very difficult to identify the error if info logging is 
> not enabled.
>  * Please refer to following code snippet from the [source 
> code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
>  to understand the above issue.
> {code:java}
>   try (HttpSolrClient httpSolrClient = new 
> HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>   .withSocketTimeout(1000)
>   .withConnectionTimeout(1000)
>   
> .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>   .build()) {
> SolrPingResponse resp = httpSolrClient.ping();
> return leaderReplica;
>   } catch (IOException e) {
> log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
> Thread.sleep(500);
>   } catch (Exception e) {
> if (e.getCause() instanceof IOException) {
>   log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>   Thread.sleep(500);
> } else {
>   return leaderReplica;
> }
>   }
> {code}
> The above issue will have high impact in production level clusters, since 
> cores not being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping 
> request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
>  in *RecoveryStrategy.java* should be configurable and the defaults set to 
> high values like 15seconds.
>  2. The exception message in [line 
> 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
>  and [line 
> 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
>  in *RecoveryStrategy.java* should be logged as *error* messages instead of 
> *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13457) Managing Timeout values in Solr

2019-07-03 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878183#comment-16878183
 ] 

Hoss Man commented on SOLR-13457:
-

SOLR-13605 shows some more of the madness involved in how these settings are 
borked -- even if you just focus on the the SolrJ APIs for specifying things 
(notably {{HttpSolrClient.Builder.withHttpClient}}) w/o even considering how 
*solr* should use those SolrJ APIs based on the things like {{solr.xml}}

> Managing Timeout values in Solr
> ---
>
> Key: SOLR-13457
> URL: https://issues.apache.org/jira/browse/SOLR-13457
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: master (9.0)
>Reporter: Gus Heck
>Priority: Major
>
> Presently, Solr has a variety of timeouts for various connections or 
> operations. These timeouts have been added, tweaked and refined and in some 
> cases made configurable in an ad-hoc manner by the contributors of individual 
> features. Throughout the history of the project. This is all well and good 
> until one experiences a timeout during an otherwise valid use case and needs 
> to adjust it.
> This has also made managing timeouts in unit tests "interesting" as noted in 
> SOLR-13389.
> Probably nobody has the spare time to do a tour de force through the code and 
> coordinate every single timeout, so in this ticket I'd like to establish a 
> framework for categorizing time outs, a standard for how we make each 
> category configurable, and then add sub-tickets to address individual 
> timeouts.
> The intention is that eventually, there will be no "magic number" timeout 
> values in code, and one can predict where to find the configuration for a 
> timeout by determining it's category.
> Initial strawman categories (feel free to knock down or suggest alternatives):
>  # *Feature-Instance Timeout*: Timeouts that relate to a particular 
> instantiation of a feature, for example a database connection timeout for a 
> connection to a particular database by DIH. These should be set in the 
> configuration of that instance.
>  # *Optional Feature Timeout*: A timeout that only has meaning in the context 
> of a particular feature that is not required for solr to function... i.e. 
> something that can be turned on or off. Perhaps a timeout for communication 
> with an external ldap for authentication purposes. These should be configured 
> in the same configuration that enables this feature.
>  # *Global System Timeout*: A timeout that will always be an active part of 
> Solr these should be configured in a new  section of solr.xml. For 
> example the Jetty thread idle timeout, or the default timeout for http calls 
> between nodes.
>  # *Node Specific Timeout*: A timeout which may differ on different nodes. I 
> don't know of any of these, but I'll grant the possibility. These (and only 
> these) should be set by setting system properties. If we don't have any of 
> these, that's just fine :).
>  # *Client Timeout*: These are timeouts in solrj code that are active in code 
> running outside the server. They should be configurable via java api, and via 
> a config file of some sort from a single location defined in a sysprop or 
> sourced from classpath (in that order). When run on the server, the solrj 
> code should look for a *Global System Timeout* setting before consulting 
> sysprops or classpath.
> *Note that in no case is a hard-coded value the correct solution.*
> If we get a consensus on categories and their locations, then the next step 
> is to begin adding sub tickets to bring specific timeouts into compliance. 
> Every such ticket should include an update to the section of the ref guide 
> documenting the configuration to which the timeout has been added (e.g. docs 
> for solr.xml for Global System Timeouts) describing what exactly is affected 
> by the timeout, the maximum allowed value and how zero and negative numbers 
> are handled.
> It is of course true that some of these values will have the potential to 
> destroy system performance or integrity, and that should be mentioned in the 
> update to documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13532) Unable to start core recovery due to timeout in ping request

2019-07-03 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878180#comment-16878180
 ] 

Hoss Man commented on SOLR-13532:
-

My first impression on seeing this patch was that I _really_ dislike the idea 
of "fixing" a hardcoded timeout by changing it to a _different_ hardcoded 
timeout – I would really much rather we use the existing {{solr.xml}} 
configured timeouts for this sort of thing.

So then I went poking around the code to refresh my memory about how/where the 
SO & CONNECT timeouts config options for intranode requests get populated in 
the code to propose an alternative patch that uses them, and realized that we 
already have an {{UpdateShardHandler.getRecoveryOnlyHttpClient()}} method that 
returns an HttpClient pre-configured with the correct timeout values ... and 
then I realized that this is already used in the code in question via 
{{withHttpClient(...)}}...
{code:java}
  // existing, pre-patch, code in RecoveryStrategy
  try (HttpSolrClient httpSolrClient = new 
HttpSolrClient.Builder(leaderReplica.getCoreUrl())
  .withSocketTimeout(1000)
  .withConnectionTimeout(1000)
  
.withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
{code}
This {{UpdateShardHandler.getRecoveryOnlyHttpClient()}} concept, and that 
corresponding {{withHttpClient()}} call, was introduced *after* the original 
recovery code was written (with those hardcoed timeouts) ... In theory if we 
just remove the {{withSocketTimeout}} and {{withConnectionTimeout}} completely 
from this class, then the cluster's {{solr.xml}} configuration options should 
start getting used.

But then I dug deeper and discovered that the way HttpSolrClient & it's Builder 
works is really silly and frustrating and causes the hardcoded values 
{{SolrClientBuilder.connectionTimeoutMillis = 15000}} and 
{{SolrClientBuilder.socketTimeoutMillis = 12}} to get used at the request 
level, even when {{withHttpClient}} has been called to set an {{HttpClient}} 
that already has the settings we want ... basically defeating a huge part of 
the value in {{withHttpClient}} ... even using values of {{null}} or {{-1}} 
won't work, because of other nonsensical ways that "default" values come into 
play

I created SOLR-13605 to track the silliness in {{HttpClient.Builder}} – it's a 
bigger issue then just fixing this ping/recovery problem, and will require more 
careful consideration.

As much as it pains me to say this: I think that for now, for the purpose of 
fixing the bug in this jira, we should just remove the {{withSocketTimeout(}} 
and {{withConnectionTimeout()}} calls completely, and defer to the 
(pre-existing) hardcoded defaults in {{SolrClientBuilder}} ... at least that 
way we're reducing the number of hardcoded defaults in the code, and if/when 
SOLR-13605 get's fixed, the {{solr.xml}} settings should take affect.

The other alternative to this would be to update the {{RecoveryStrategy}} code 
to use something like {{cc.getConfig().getUpdateShardHandlerConfig()}} and then 
use {{UpdateShardHandlerConfig.getDistributedSocketTimeout()}} and 
{{UpdateShardHandlerConfig.getDistributedConnectionTimeout()}} to pass as the 
inputs to {{SolrHttpClient.Builder}} ... that seemed really silly and redundent 
when it first occured to me, but the more i think about it the more it's 
probably not that bad as a work around for SOLR-13605 until it's fixed.

What do folks think?

> Unable to start core recovery due to timeout in ping request
> 
>
> Key: SOLR-13532
> URL: https://issues.apache.org/jira/browse/SOLR-13532
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 7.6
>Reporter: Suril Shah
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception 
> message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  
> (recoveryExecutor-4-thread-1-processing-n::8983_solr 
> x:_shard41_replica_n2777 c: s:shard41 
> r:core_node2778) x:_shard41_replica_n2777 
> o.a.s.c.RecoveryStrategy Failed to connect leader http://:8983/solr 
> on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period 
> which is hard-coded to one second in solr source code. However In a general 
> production setting it is common to have ping time more than one second, 
> hence, the core recovery never starts and exception is thrown.
>  * Also the other major concern is that this exception is logged as an info 
> message, hence it is very difficult to identify the error if info logging is 
> not enabled.
>  * Please refer to following code snippet from the [source 
> 

[jira] [Created] (SOLR-13605) HttpSolrClient.Builder.withHttpClient() is useless for the purpose of setting client scoped so/connect timeouts

2019-07-03 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13605:
---

 Summary: HttpSolrClient.Builder.withHttpClient() is useless for 
the purpose of setting client scoped so/connect timeouts
 Key: SOLR-13605
 URL: https://issues.apache.org/jira/browse/SOLR-13605
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man


TL;DR: trying to use {{HttpSolrClient.Builder.withHttpClient}} is useless for 
the the purpose of specifying an {{HttpClient}} with the default "timeouts" you 
want to use on all requests, because of how {{HttpSolrClient.Builder}} and 
{{HttpClientUtil.createDefaultRequestConfigBuilder()}} hardcode values thta get 
set on every {{HttpRequest}}.

This internally affects code that uses things like 
{{UpdateShardHandler.getDefaultHttpClient()}}, 
{{UpdateShardHandler.getUpdateOnlyHttpClient()}} 
{{UpdateShardHandler.getRecoveryOnlyHttpClient()}}, etc...

While looking into the patch in SOLR-13532, I realized that the way 
{{HttpSolrClient.Builder}} and it's super class {{SolrClientBuilder}} work, the 
following code doesn't do what a reasonable person would expect...
{code:java}
SolrParams clientParams = params(HttpClientUtil.PROP_SO_TIMEOUT, 12345,
 HttpClientUtil.PROP_CONNECTION_TIMEOUT, 67890);
HttpClient httpClient = HttpClientUtil.createClient(clientParams);
HttpSolrClient solrClient = new HttpSolrClient.Builder(ANY_BASE_SOLR_URL)
.withHttpClient(httpClient)
.build();
{code}
When {{solrClient}} is used to execute a request, neither of the properties 
passed to {{HttpClientUtil.createClient(...)}} will matter - the 
{{HttpSolrClient.Builder}} (via inheritence from {{SolrClientBuilder}} has the 
following hardcoded values...
{code:java}
  // SolrClientBuilder
  protected Integer connectionTimeoutMillis = 15000;
  protected Integer socketTimeoutMillis = 12;
{code}
...which unless overridden by calls to {{withConnectionTimeout()}} and 
{{withSocketTimeout()}} will get set on the {{HttpSolrClient}} object, and used 
on every request...
{code:java}
// protected HttpSolrClient constructor
this.connectionTimeout = builder.connectionTimeoutMillis;
this.soTimeout = builder.socketTimeoutMillis;

{code}
It would be tempting to try and do something like this to work around the 
problem...
{code:java}
SolrParams clientParams = params(HttpClientUtil.PROP_SO_TIMEOUT, 12345,
 HttpClientUtil.PROP_CONNECTION_TIMEOUT, 67890);
HttpClient httpClient = HttpClientUtil.createClient(clientParams);
HttpSolrClient solrClient = new HttpSolrClient.Builder(ANY_BASE_SOLR_URL)
.withHttpClient(httpClient)
.withSocketTimeout(null)
.withConnectionTimeout(null)
.build();
{code}
...except for 2 problems:
 # In {{HttpSolrClient.executeMethod}}, if the values of 
{{this.connectionTimeout}} or {{this.soTimeout}} are null, then the values from 
{{HttpClientUtil.createDefaultRequestConfigBuilder();}} get used, which has 
it's own hardcoded defaults.
 # {{withSocketTimeout}} and {{withConnectionTimeout}} take an int, not a 
(nullable) Integer.

So then maybe something like this would work? - particularly since at the 
{{HttpClient}} / {{HttpRequest}} / {{RequestConfig}} level, a "-1" set on the 
{{HttpRequest}}'s {{RequestConfig}} is suppose to mean "use the (client) 
default" ...
{code:java}
SolrParams clientParams = params(HttpClientUtil.PROP_SO_TIMEOUT, 12345,
 HttpClientUtil.PROP_CONNECTION_TIMEOUT, 67890);
HttpClient httpClient = HttpClientUtil.createClient(clientParams);
HttpSolrClient client = new HttpSolrClient.Builder(ANY_BASE_SOLR_URL)
.withHttpClient(httpClient)
.withSocketTimeout(-1)
.withConnectionTimeout(-1)
.build();
{code}
...except that if we do *that* we get an IllegalArgumentException...
{code:java}
  // SolrClientBuilder
  public B withConnectionTimeout(int connectionTimeoutMillis) {
if (connectionTimeoutMillis < 0) {
  throw new IllegalArgumentException("connectionTimeoutMillis must be a 
non-negative integer.");
}
{code}
This is madness, and eliminates most/all of the known value of using 
{{.withHttpClient}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

2019-07-02 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13599:

Attachment: thetaphi_Lucene-Solr-master-Windows_8025.log.txt
Status: Open  (was: Open)


Details of Uwe's jenkins updates...

* 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201906.mbox/%3C00b301d52918$d27b2f60$77718e20$@thetaphi.de%3E
* 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/%3C01a901d530a7$fac9d2a0$f05d77e0$@thetaphi.de%3E
* 
http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/raw/%3C01a901d530a7$fac9d2a0$f05d77e0$@thetaphi.de%3E/4



I'm attaching thetaphi_Lucene-Solr-master-Windows_8025.log.txt as an 
illustrative example of the failure, here are some key snippets and the 
associated lines from the test class...


{noformat}

# Previously: test individual adds, delById, and delByyQ using...
#  ... rf=3 with all replicas connected,
#  ... rf=2 when one replica's proxy is closed,
#  ... rf=1 when both replica's proxies are closed

# Lines # 314-320 - "heal" the cluster (re-enable all proxies)

...
   [junit4]   2> 555732 INFO  
(TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [ ] 
o.a.s.c.AbstractFullDistribZkTestBase Found 3 replicas and leader on 
127.0.0.1:59004_ for shard1 in repfacttest_c8n_1x3
   [junit4]   2> 555732 INFO  
(TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [ ] 
o.a.s.c.AbstractFullDistribZkTestBase Took 7107.0 ms to see all replicas become 
active.
...


# Lines # 322-326 - checks that (individual) add, delById & delByQ all get rf=3

# Lines # 328-341 - checks that (batched) add, delById & delByQ all get rf=3

# Line #  344 - close a proxy port (59108) again ...

   [junit4]   2> 556060 WARN  
(TEST-ReplicationFactorTest.test-seed#[C415B4F186C6C69D]) [ ] 
o.a.s.c.s.c.SocketProxy Closing 1 connections to: http://127.0.0.1:59108/, 
target: http://127.0.0.1:59109/
{noformat}

At this point, the next thing in the test is to add a batch of documents 
(ids#15-29) while one replica is partitioned -- but I should point out that 
it's not immediately obvious to me if the {{(updateExecutor-1924-thread-4}} 
logging from the leader below (complaining about {{Connection refused:}} to 
port 59108 is *because* of the update sent my the client, or independently 
because of the HTTP2 connection management detecting that the proxy was 
closed...

{noformat}
# Lines # 346-355 - send our first "batch" (id#15-29) when cluster isn't 
"healed"

   [junit4]   2> 558074 ERROR 
(updateExecutor-1924-thread-4-processing-x:repfacttest_c8n_1x3_shard1_replica_n2
 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) 
[n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 
x:repfacttest_c8n_1x3_shard1_replica_n2 ] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling 
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ to 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/
   [junit4]   2>   => java.io.IOException: java.net.ConnectException: 
Connection refused: no further information
...

# ...there are more details about supressed exceptions
# ...this ERROR repeats many times - evidently as the leader tries to 
reconnect...

...
   [junit4]   2> 560193 ERROR 
(updateExecutor-1924-thread-4-processing-x:repfacttest_c8n_1x3_shard1_replica_n2
 r:core_node5 null n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1) 
[n:127.0.0.1:59004_ c:repfacttest_c8n_1x3 s:shard1 r:core_node5 
x:repfacttest_c8n_1x3_shard1_replica_n2 ] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling 
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/ to 
http://127.0.0.1:59108/repfacttest_c8n_1x3_shard1_replica_n3/
   [junit4]   2>   => java.io.IOException: java.net.ConnectException: 
Connection refused: no further information
...

# ... brief bit of path=/admin/metrics logging from both n:127.0.0.1:59004_ and 
n:127.0.0.1:59084_
# ... and some other MetricsHistoryHandler logging (from overseer?) about 
failing to talk to 127.0.0.1:59108
# ... but mostly lots of logging from the leader about not being able to 
connect to 127.0.0.1:59108



# live replica (port 59060) logs that it's added the 15 docs FROMLEADER, ... 
BUT...
# ... same thread then logs jetty EofException: Reset cancel_stream_error
# ... so aparently it added the docs but had a problem communicating that back 
to the leader
# ... evidently because it took 30 seconds (QTime = 30013) and leader gave up 
(see below)

   [junit4]   2> 591364 INFO  (qtp1520091886-5884) [n:127.0.0.1:59060_ 
c:repfacttest_c8n_1x3 s:shard1 r:core_node4 
x:repfacttest_c8n_1x3_shard1_replica_n1 ] o.a.s.u.p.LogUpdateProcessorFactory 
[repfacttest_c8n_1x3_shard1_replica_n1]  webapp= path=/update 

[jira] [Updated] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

2019-07-02 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13599:

Description: 
We've started seeing some weirdly consistent (but not reliably reproducible) 
failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
machines.

The failures all seem to have started on June 22 -- when Uwe upgraded his 
Windows VMs to upgrade the Java version, but happen across all versions of java 
tested, and on both the master and branch_8x.

While this test failed a total of 5 times, in different ways, on various 
jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all 
but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it 
fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds 
frequently fails anywhere from 1-4 additional times.

All of these failures occur in the exact same place, with the exact same 
assertion: that the expected replicationFactor of 2 was not achieved, and an 
rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to 
a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned 
off due to a closed proxy.

In the handful of logs I've examined closely, the 2nd "live" replica does in 
fact log that it recieved & processed the update, but with a QTime of over 30 
seconds, and it then it immediately logs an 
{{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount 
of {{java.net.ConnectException: Connection refused: no further information}} 
regarding the replica that was partitioned off, before a second 
{{updateExecutor}} thread ultimately logs 
{{java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
replica.




What makes this perplexing is that this is not the first time in the test that 
documents were added to this collection while one replica was partitioned off, 
but it is the first time that all 3 of the following are true _at the same 
time_:

# the collection has recovered after some replicas were partitioned and 
re-connected
# a batch of multiple documents is being added
# one replica has been "re" partitioned.

...prior to the point when this failure happens, only individual document adds 
were tested while replicas where partitioned.  Batches of adds were only tested 
when all 3 replicas were "live" after the proxies were re-opened and the 
collection had fully recovered.  The failure also comes from the first update 
to happen after a replica's proxy port has been "closed" for the _second_ time.

While this conflagration of events might concievible trigger some weird bug, 
what makes these failures _particularly_ perplexing is that:
* the failures only happen on Windows
* the failures only started after the Windows VM update on June-22.



  was:

We've started seeing some weirdly consistent (but not reliably reproducible) 
failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
machines.

The failures all seem to have started on June 22 -- when Uwe upgraded his 
Windows VMs to upgrade the Java version, but happen across all versions of java 
tested, and on both the master and branch_8x.

While this test failed a total of 5 times, in different ways, on various 
jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all 
but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it 
fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds 
frequently fails anywhere from 1-4 additional times.

All of these failures occur in the exact same place, with the exact same 
assertion: that the expected replicationFactor of 2 was not achieved, and an 
rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to 
a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned 
off due to a closed proxy.

In the handful of logs I've examined closely, the 2nd "live" replica does in 
fact log that it recieved & processed the update, but with a QTime of over 30 
seconds, and it then it immediately logs an 
{{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount 
of {{java.net.ConnectException: Connection refused: no further information}} 
regarding the replica that was partitioned off, before a second 
{{updateExecutor}} thread ultimately logs 
{{java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
replica.


> ReplicationFactorTest high failure rate on Windows jenkins VMs after 
> 2019-06-22 OS/java upgrades
> 
>
>

[jira] [Created] (SOLR-13599) ReplicationFactorTest high failure rate on Windows jenkins VMs after 2019-06-22 OS/java upgrades

2019-07-02 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13599:
---

 Summary: ReplicationFactorTest high failure rate on Windows 
jenkins VMs after 2019-06-22 OS/java upgrades
 Key: SOLR-13599
 URL: https://issues.apache.org/jira/browse/SOLR-13599
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man



We've started seeing some weirdly consistent (but not reliably reproducible) 
failures from ReplicationFactorTest when running on Uwe's Windows jenkins 
machines.

The failures all seem to have started on June 22 -- when Uwe upgraded his 
Windows VMs to upgrade the Java version, but happen across all versions of java 
tested, and on both the master and branch_8x.

While this test failed a total of 5 times, in different ways, on various 
jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all 
but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it 
fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds 
frequently fails anywhere from 1-4 additional times.

All of these failures occur in the exact same place, with the exact same 
assertion: that the expected replicationFactor of 2 was not achieved, and an 
rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to 
a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned 
off due to a closed proxy.

In the handful of logs I've examined closely, the 2nd "live" replica does in 
fact log that it recieved & processed the update, but with a QTime of over 30 
seconds, and it then it immediately logs an 
{{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception -- 
meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount 
of {{java.net.ConnectException: Connection refused: no further information}} 
regarding the replica that was partitioned off, before a second 
{{updateExecutor}} thread ultimately logs 
{{java.util.concurrent.ExecutionException: 
java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live" 
replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-12988) Known OpenJDK >= 11 SSL (TLSv1.3) bugs can cause problems with Solr

2019-07-01 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-12988.
-
Resolution: Workaround

With the jenkins servers upgraded, and the new SSLTestConfig assumptions in 
place i haven't seen any (obvious) signs of any other openJDK related SSL bugs 
in the solr tests ... if more are identified we can update the issue 
description to list them here.

I've also created SOLR-13594 to track the (eventual) need to enable SSL testing 
on java-13-ea once the known bugs are addressed (but fortunately, the way the 
supression logic is implemented, it explicitly checks for "ea" bbuilds ... so 
even if we never get a chance to proactively test on future java-13-ea builds, 
once java-13 final comes out, the tests _will_ try SSL on them automatically)

> Known OpenJDK >= 11 SSL (TLSv1.3) bugs can cause problems with Solr
> ---
>
> Key: SOLR-12988
> URL: https://issues.apache.org/jira/browse/SOLR-12988
> Project: Solr
>  Issue Type: Test
>Reporter: Hoss Man
>Assignee: Cao Manh Dat
>Priority: Major
>  Labels: Java11, Java12, Java13
> Attachments: SOLR-12988.patch, SOLR-12988.patch, SOLR-12988.patch, 
> SOLR-13413.patch
>
>
> There are several known OpenJDK JVM bugs (begining with Java11, when TLS v1.3 
> support was first added) that are known to affect Solr's SSL support, and 
> have caused numerous test failures -- notably early "testing" builds of 
> OpenJDK 11, 12, & 13, as well as the officially released OpenJDK 11, 11.0.1, 
> and 11.0.2.
> From the standpoint of the Solr project, there is very little we can do to 
> mitigate these bugs, and have taken steps to ensure any code using our 
> {{SSLTestConfig}} / {{RandomizeSSL}} test-framework classes will be "SKIPed" 
> with an {{AssumptionViolatedException}} when used on JVMs that are known to 
> be problematic.
> Users who encounter any of the types of failures described below, or 
> developers who encounter test runs that "SKIP" with a message refering to 
> this issue ID, are encouraged to Upgrade their JVM. (or as a last resort: try 
> disabling "TLSv1.3" in your JVM security properties)
> 
> Examples of known bugs as they have manifested in Solr tests...
> * https://bugs.openjdk.java.net/browse/JDK-8212885
> ** "TLS 1.3 resumed session does not retain peer certificate chain"
> ** affects users with {{checkPeerNames=true}} in your SSL configuration
> ** causes 100% failure rate in Solr's 
> {{TestMiniSolrCloudClusterSSL.testSslWithCheckPeerName}}
> ** can result in exceptions for SolrJ users, or in solr cloud server logs 
> when making intra-node requests, with a root cause of 
> "javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated"
> ** {noformat}
>[junit4]   2> Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer 
> not authenticated
>[junit4]   2>  at 
> java.base/sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:526)
>[junit4]   2>  at 
> org.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:464)
>[junit4]   2>  at 
> org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:397)
>[junit4]   2>  at 
> org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
>[junit4]   2>  at 
> org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>[junit4]   2>  at 
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
>[junit4]   2>  at 
> org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
>[junit4]   2>  at 
> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
>[junit4]   2>  at 
> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
>[junit4]   2>  at 
> org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
>[junit4]   2>  at 
> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
>[junit4]   2>  at 
> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>[junit4]   2>  at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>[junit4]   2>  at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>[junit4]   2>  at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
> {noformat}
> * https://bugs.openjdk.java.net/browse/JDK-8213202
> ** "Possible race condition in TLS 1.3 session resumption"
> ** May 

[jira] [Created] (SOLR-13594) re-enable j13-ea SSL testing once known bugs are fixed

2019-07-01 Thread Hoss Man (JIRA)
Hoss Man created SOLR-13594:
---

 Summary: re-enable j13-ea SSL testing once known bugs are fixed
 Key: SOLR-13594
 URL: https://issues.apache.org/jira/browse/SOLR-13594
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man


SOLR-12988 tracks several known bugs affecting SSL usage in OpenJDK java-13-ea 
and builds.

at the moment, SSLTestConfig explicitly throws AssumptionViolatedException if 
it looks like tests are being run on _any_ java-13-ea build ... once the known 
bugs are addressed in OpenJDK, and new java-13-ea builds are released that 
address those bugs, we should patch SSLTestConfig to test those new EA builds, 
an assuming no other obvious bugs are identified:
* update the logic in SSLTestConfig to suppress SSL testing only on the 
java-13-ea build#s known to be problematic.
* update the jenkins JVMs to use the new EA builds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13580) java 13 changes to locale specific Numeric parsing rules affect ParseNumeric UpdateProcessors when using 'local' config option - notably affects French

2019-06-28 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13580:

Description: 
Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has 
updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the definition 
of language & locale specific formatting characters – in a non-backwards 
compatible way due to "French" changes in [CLDR 
34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes]

This impacts people who use any of the "ParseNumeric" UpdateProcessors in 
conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the 
(pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a 
"grouping" character (ie: between thousands and million, between millions and 
billions, etc...). Starting with java13 the JVM expects U+202F (NARROW NO BREAK 
SPACE) in it's place.

Notably: upgrading to jdk13-ea+26 caused failures in Solr's 
ParsingFieldUpdateProcessorsTest which was initially had hardcoded test data 
that used U+00A0. ParsingFieldUpdateProcessorsTest has since been updated to 
account for this discrepency by modifying the test data used to determine the 
"expected" character for the current JVM, but there is nothing Solr or the 
ParseNumeric UpdateProcessors can do to help mitigate this change in behavior 
for end users who upgrade to java13.

Affected users with U+00A0 characters in their incoming SolrInputDocuments will 
see the ParseNumeric UpdateProcessors (configured with locale=fr...) "skip" 
these values as unparsable, most likely resulting in a failure to index into a 
numeric field since the original "String" value will be left as is.

Affected users may want to consider updating their configs to include a 
{{RegexReplaceProcessorFactory}} configured to strip out all whitespace 
characters, prior to any ParseNumeric update processors configured expect 
french langauge numbers
  

  was:
Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has 
updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the definition 
of language & locale specific formatting characters – in a non-backwards 
compatible way due to "French" changes in [CLDR 
34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes]

This impacts people who use any of the "ParseNumeric" UpdateProcessors in 
conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the 
(pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a 
"grouping" character (ie: between thousands and million, between millions and 
billions, etc...). Starting with java13 the JVM expects U+202F (NARROW NO BREAK 
SPACE) in it's place.

Notably: upgrading to jdk13-ea+26 caused failures in Solr's 
ParsingFieldUpdateProcessorsTest which was initially had hardcoded test data 
that used U+00A0. ParsingFieldUpdateProcessorsTest has since been updated to 
account for this discrepency by modifying the test data used to determine the 
"expected" character for the current JVM, but there is nothing Solr or the 
ParseNumeric UpdateProcessors can do to help mitigate this change in behavior 
for end users who upgrade to java13.

Affected users with U+00A0 characters in their incoming SolrInputDocuments will 
see the ParseNumeric UpdateProcessors (configured with locale=fr...) "skip" 
these values as unparsable, most likely resulting in a failure to index into a 
numeric field since the original "String" value will be left as is.
  


update dsecription with a possible workaround that just occured to me

> java 13 changes to locale specific Numeric parsing rules affect ParseNumeric 
> UpdateProcessors when using 'local' config option  - notably affects French
> 
>
> Key: SOLR-13580
> URL: https://issues.apache.org/jira/browse/SOLR-13580
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
>  Labels: Java13
> Attachments: SOLR-13580.patch
>
>
> Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has 
> updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the 
> definition of language & locale specific formatting characters – in a 
> non-backwards compatible way due to "French" changes in [CLDR 
> 34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes]
> This impacts people who use any of the "ParseNumeric" UpdateProcessors in 
> conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the 
> (pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a 
> 

[jira] [Resolved] (SOLR-13580) java 13 changes to locale specific Numeric parsing rules affect ParseNumeric UpdateProcessors when using 'local' config option - notably affects French

2019-06-28 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-13580.
-
Resolution: Not A Bug

> java 13 changes to locale specific Numeric parsing rules affect ParseNumeric 
> UpdateProcessors when using 'local' config option  - notably affects French
> 
>
> Key: SOLR-13580
> URL: https://issues.apache.org/jira/browse/SOLR-13580
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
>  Labels: Java13
> Attachments: SOLR-13580.patch
>
>
> Per [JDK-8221432|https://bugs.openjdk.java.net/browse/JDK-8221432] Java13 has 
> updated to [CLDR 35.1|http://cldr.unicode.org/] – which controls the 
> definition of language & locale specific formatting characters – in a 
> non-backwards compatible way due to "French" changes in [CLDR 
> 34|http://cldr.unicode.org/index/downloads/cldr-34#TOC-Detailed-Data-Changes]
> This impacts people who use any of the "ParseNumeric" UpdateProcessors in 
> conjunction with the "locale=fr" or "locale=fr_FR" init param and expect the 
> (pre java13) existing behavior of treating U+00A0 (NO BREAK SPACE) as a 
> "grouping" character (ie: between thousands and million, between millions and 
> billions, etc...). Starting with java13 the JVM expects U+202F (NARROW NO 
> BREAK SPACE) in it's place.
> Notably: upgrading to jdk13-ea+26 caused failures in Solr's 
> ParsingFieldUpdateProcessorsTest which was initially had hardcoded test data 
> that used U+00A0. ParsingFieldUpdateProcessorsTest has since been updated to 
> account for this discrepency by modifying the test data used to determine the 
> "expected" character for the current JVM, but there is nothing Solr or the 
> ParseNumeric UpdateProcessors can do to help mitigate this change in behavior 
> for end users who upgrade to java13.
> Affected users with U+00A0 characters in their incoming SolrInputDocuments 
> will see the ParseNumeric UpdateProcessors (configured with locale=fr...) 
> "skip" these values as unparsable, most likely resulting in a failure to 
> index into a numeric field since the original "String" value will be left as 
> is.
> Affected users may want to consider updating their configs to include a 
> {{RegexReplaceProcessorFactory}} configured to strip out all whitespace 
> characters, prior to any ParseNumeric update processors configured expect 
> french langauge numbers
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >