[jira] [Updated] (SOLR-13727) V2Requests: HttpSolrClient replaces first instance of "/solr" with "/api" instead of using regex pattern

2019-09-04 Thread Yonik Seeley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13727:

Fix Version/s: 8.3
   master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> V2Requests: HttpSolrClient replaces first instance of "/solr" with "/api" 
> instead of using regex pattern
> 
>
> Key: SOLR-13727
> URL: https://issues.apache.org/jira/browse/SOLR-13727
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: clients - java, v2 API
>Affects Versions: 8.2
>Reporter: Megan Carey
>Priority: Major
>  Labels: easyfix, patch
> Fix For: master (9.0), 8.3
>
> Attachments: SOLR-13727.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When the HttpSolrClient is formatting a V2Request, it needs to change the 
> endpoint from the default "/solr/..." to "/api/...". It does so by simply 
> calling String.replace, which replaces the first instance of "/solr" in the 
> URL with "/api".
>  
> In the case where the host's address starts with "solr" and the HTTP protocol 
> is appended, this call changes the address for the request. Example:
> if baseUrl is "http://solr-host.com/8983/solr;, this call will change to 
> "http:/api-host.com:8983/solr"
>  
> We should use a regex pattern to ensure that we're replacing the correct 
> portion of the URL.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13727) V2Requests: HttpSolrClient replaces first instance of "/solr" with "/api" instead of using regex pattern

2019-09-03 Thread Yonik Seeley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921677#comment-16921677
 ] 

Yonik Seeley commented on SOLR-13727:
-

Changes look good to me! I'll commit soon unless anyone else sees an issue with 
this approach.

> V2Requests: HttpSolrClient replaces first instance of "/solr" with "/api" 
> instead of using regex pattern
> 
>
> Key: SOLR-13727
> URL: https://issues.apache.org/jira/browse/SOLR-13727
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: clients - java, v2 API
>Affects Versions: 8.2
>Reporter: Megan Carey
>Priority: Major
>  Labels: easyfix, patch
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When the HttpSolrClient is formatting a V2Request, it needs to change the 
> endpoint from the default "/solr/..." to "/api/...". It does so by simply 
> calling String.replace, which replaces the first instance of "/solr" in the 
> URL with "/api".
>  
> In the case where the host's address starts with "solr" and the HTTP protocol 
> is appended, this call changes the address for the request. Example:
> if baseUrl is "http://solr-host.com/8983/solr;, this call will change to 
> "http:/api-host.com:8983/solr"
>  
> We should use a regex pattern to ensure that we're replacing the correct 
> portion of the URL.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13695) SPLITSHARD (link), followed by DELETESHARD of parent shard causes data loss

2019-08-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907581#comment-16907581
 ] 

Yonik Seeley commented on SOLR-13695:
-

Was the SPLITSHARD asynchronous?  I'm wondering if maybe the DELETESHARD 
happened before the SPLITSHARD completed.

> SPLITSHARD (link), followed by DELETESHARD of parent shard causes data loss
> ---
>
> Key: SOLR-13695
> URL: https://issues.apache.org/jira/browse/SOLR-13695
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Critical
>
> One of my clients experienced data loss with the following sequence of 
> operations:
> 1) SPLITSHARD with method as "link".
> 2) DELETESHARD of the parent (inactive) shard.
> 3) Query for documents in the subshards, seems like both subshards have 0 
> documents.
> Proposing a fix (after offline discussion with [~noble.paul]) based on 
> running FORCEMERGE after SPLITSHARD (such that segments are rewritten), and 
> not letting DELETESHARD delete the data directory until the FORCEMERGE 
> operations finish.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-08-08 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903418#comment-16903418
 ] 

Yonik Seeley commented on SOLR-13399:
-

Ah, yep... spltiByPrefix definitely should not be defaulting to true!  It ended 
up normally doing nothing (since id_prefix was normally not populated), but 
that changed when the last commit to use the indexed "if" field was added.  
I'll fix the default to be false.

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch, 
> SOLR-13399_testfix.patch, SOLR-13399_useId.patch, 
> ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-08-08 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903395#comment-16903395
 ] 

Yonik Seeley commented on SOLR-13399:
-

Weird... I don't know how that commit could have caused a failure in 
ShardSplitTest, but I'll investigate.

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch, 
> SOLR-13399_testfix.patch, SOLR-13399_useId.patch, 
> ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13399) compositeId support for shard splitting

2019-08-03 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13399:

Attachment: SOLR-13399_useId.patch
Status: Reopened  (was: Reopened)

Here's an enhancement that uses the "id" field for histogram generation if 
there is nothing found in the "id_prefix" field.


> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch, 
> SOLR-13399_testfix.patch, SOLR-13399_useId.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13399) compositeId support for shard splitting

2019-07-29 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13399:

Attachment: SOLR-13399_testfix.patch
Status: Reopened  (was: Reopened)

Attaching patch to fix the test bug by explicitly forcing the number of bits in 
the test when using tri-level ids "foo/16!bar!doc1"

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch, 
> SOLR-13399_testfix.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-07-29 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895494#comment-16895494
 ] 

Yonik Seeley commented on SOLR-13399:
-

OK, figured out the issue...
It turns out that if you have foo!, foo!bar! will normally not nest under it.  
The number of bits used for the first part of the hash is dynamic depending on 
the number of levels in the composite hash ID.  That's unfortunate for a number 
of reasons.  It also breaks the initial bi-level hash that guaranteed that you 
could just add a prefix to any document id without any escaping (i.e. if your 
ID happens to contain "!", it can cause the document hash to fall outside of 
the parent hash prefix.)

It looks like is working as designed (according to SOLR-5320), but it was 
certainly surprising since it prevents hash routing from working out-of-the-box 
in conjunction with tri-level ids without explicitly specifying bits with the 
"/" notation.


> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-07-24 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892321#comment-16892321
 ] 

Yonik Seeley commented on SOLR-13399:
-

Thanks for the heads up, I'll investigate.

bq. Also: it's really not cool to be adding new end user features/params w/o at 
least adding a one line summary of the new param to the relevant ref-guide page.

Sure, I had planned on doing so before 8.3 (unless you mean we've generally 
moved to doing doc it as part of the initial commit?  If so, I missed that.)

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11266) V2 API returning wrong content-type

2019-07-19 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889155#comment-16889155
 ] 

Yonik Seeley commented on SOLR-11266:
-

> One can't say that we are serving valid JSON

Perhaps not a valid HTTP JSON response, but a valid text response containing 
valid JSON.  It was deliberate and still standards conforming, and no longer 
needed.
For more context, some of our previous tutorials embedded hyperlinks that users 
were supposed to click on and see results in their browsers (which resulted in 
a very poor experience when a browser couldn't handle the content-type by 
default)

> V2 API returning wrong content-type
> ---
>
> Key: SOLR-11266
> URL: https://issues.apache.org/jira/browse/SOLR-11266
> Project: Solr
>  Issue Type: Bug
>  Components: v2 API
>Reporter: Ishan Chattopadhyaya
>Priority: Major
> Attachments: SOLR-11266.patch
>
>
> The content-type of the returned value is wrong in many places. It should 
> return "application/json", but instead returns "application/text-plan".
> Here's an example:
> {code}
> [ishan@t430 ~] $ curl -v 
> "http://localhost:8983/api/collections/products/select?q=*:*=0;
> *   Trying 127.0.0.1...
> * TCP_NODELAY set
> * Connected to localhost (127.0.0.1) port 8983 (#0)
> > GET /api/collections/products/select?q=*:*=0 HTTP/1.1
> > Host: localhost:8983
> > User-Agent: curl/7.51.0
> > Accept: */*
> > 
> < HTTP/1.1 200 OK
> < Content-Type: text/plain;charset=utf-8
> < Content-Length: 184
> < 
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":1,
> "params":{
>   "q":"*:*",
>   "rows":"0"}},
>   "response":{"numFound":260,"start":0,"docs":[]
>   }}
> * Curl_http_done: called premature == 0
> * Connection #0 to host localhost left intact
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11266) V2 API returning wrong content-type

2019-07-19 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889096#comment-16889096
 ] 

Yonik Seeley commented on SOLR-11266:
-

> I'm not of the opinion that users looking at a response in a browser are our 
> main target audience.

More than someone trying to write a context-free Solr client I'd say ;)
 I think most people wanted application/json out of a misguided sense of 
correctness (but it's not incorrect to have json formatted text in a plain text 
HTTP response, and I disagree that this issue should be categorized as a bug.) 
Although one can argue that application/json is *more* appropriate given that 
it's more specific.

That said, I just tried out the current versions of chrome, safari, and firefox 
and they all now work when application/json is used, so I'm fine with using 
"application/json" by default going forward.  When this was previously decided, 
it was the case that no major browsers supported that content-type.


> V2 API returning wrong content-type
> ---
>
> Key: SOLR-11266
> URL: https://issues.apache.org/jira/browse/SOLR-11266
> Project: Solr
>  Issue Type: Bug
>  Components: v2 API
>Reporter: Ishan Chattopadhyaya
>Priority: Major
> Attachments: SOLR-11266.patch
>
>
> The content-type of the returned value is wrong in many places. It should 
> return "application/json", but instead returns "application/text-plan".
> Here's an example:
> {code}
> [ishan@t430 ~] $ curl -v 
> "http://localhost:8983/api/collections/products/select?q=*:*=0;
> *   Trying 127.0.0.1...
> * TCP_NODELAY set
> * Connected to localhost (127.0.0.1) port 8983 (#0)
> > GET /api/collections/products/select?q=*:*=0 HTTP/1.1
> > Host: localhost:8983
> > User-Agent: curl/7.51.0
> > Accept: */*
> > 
> < HTTP/1.1 200 OK
> < Content-Type: text/plain;charset=utf-8
> < Content-Length: 184
> < 
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":1,
> "params":{
>   "q":"*:*",
>   "rows":"0"}},
>   "response":{"numFound":260,"start":0,"docs":[]
>   }}
> * Curl_http_done: called premature == 0
> * Connection #0 to host localhost left intact
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13399) compositeId support for shard splitting

2019-07-19 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-13399.
-
   Resolution: Fixed
Fix Version/s: 8.3

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 8.3
>
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13399) compositeId support for shard splitting

2019-07-19 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reassigned SOLR-13399:
---

Assignee: Yonik Seeley

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

2019-07-18 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888048#comment-16888048
 ] 

Yonik Seeley commented on SOLR-13399:
-

Final patch attached, I plan on committing soon. Some implementation notes:
- this only takes into account 2-level prefix keys, not tri-level yet (that can 
be a followup JIRA)
- we currently only split into 2 ranges (again, can be extended in a followup 
JIRA)
- if "id_prefix" has no values/data then no "ranges" split recommendation is 
returned and the split proceeds as if splitByPrefix had not been specified.
  - in the future we could use the "id" field as a slower version
- Split within a prefix is only done if there are not multiple prefix buckets 
in the shard (i.e. no allowedSizeDifference implemented in this issue)

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Priority: Major
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13399) compositeId support for shard splitting

2019-07-18 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13399:

Attachment: SOLR-13399.patch

> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Priority: Major
> Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13399) compositeId support for shard splitting

2019-07-10 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882321#comment-16882321
 ] 

Yonik Seeley edited comment on SOLR-13399 at 7/10/19 6:43 PM:
--

Here's a draft patch (no tests yet) for feedback.
This adds a parameter "splitByPrefix" to SPLITSHARD.  When the overseer sees 
this parameter, it sends an additional SPLIT request with the "getRanges" 
parameter set.  This causes SPLIT (SplitOp.java) to calculate the ranges based 
on the prefix field "id_prefix" and return the recommended split string in the 
response in the "ranges" parameter.  SPLITSHARD in the overseer then proceeds 
as if that ranges string had been passed in by the user.

"id_prefix" is currently populated via a copyField in the schema:
{code}

  
  
  

  

  
{code}

The prefix field is currently always "id_prefix" (convention / implicit).  Not 
sure if it adds value to make it configurable via a "field" parameter on the 
SPLITSHARD command.



was (Author: ysee...@gmail.com):
Here's a draft patch (no tests yet) for feedback.
This adds a parameter "splitByPrefix" to SPLITSHARD.  When the overseer sees 
this parameter, it sends an additional SPLIT request with the "getRanges" 
parameter set.  This causes SPLIT (SplitOp.java) to calculate the ranges based 
on the prefix field "id_prefix" and return the recommended split string in the 
response in the "ranges" parameter.  SPLITSHARD in the overseer then proceeds 
as if that ranges string had been passed in by the user.

"id_prefix" is currently populated via a copyField in the schema:
{code}

  
  
  

  

  
{code}

The field "id_prefix" is currently hard-coded.  Perhaps this should be made 
configurable via a "field" parameter on the SPLITSHARD command?


> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Priority: Major
> Attachments: SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13399) compositeId support for shard splitting

2019-07-10 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13399:

Attachment: SOLR-13399.patch
Status: Open  (was: Open)

Here's a draft patch (no tests yet) for feedback.
This adds a parameter "splitByPrefix" to SPLITSHARD.  When the overseer sees 
this parameter, it sends an additional SPLIT request with the "getRanges" 
parameter set.  This causes SPLIT (SplitOp.java) to calculate the ranges based 
on the prefix field "id_prefix" and return the recommended split string in the 
response in the "ranges" parameter.  SPLITSHARD in the overseer then proceeds 
as if that ranges string had been passed in by the user.

"id_prefix" is currently populated via a copyField in the schema:
{code}

  
  
  

  

  
{code}

The field "id_prefix" is currently hard-coded.  Perhaps this should be made 
configurable via a "field" parameter on the SPLITSHARD command?


> compositeId support for shard splitting
> ---
>
> Key: SOLR-13399
> URL: https://issues.apache.org/jira/browse/SOLR-13399
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Priority: Major
> Attachments: SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13350) Explore collector managers for multi-threaded search

2019-05-15 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840549#comment-16840549
 ] 

Yonik Seeley commented on SOLR-13350:
-

In general, it seems like an executor for parallel searches would be more 
useful at the CoreContainer level.  If the executor is per-searcher, then 
picking a high enough pool size for good concurrency for a single core means 
that one would get way to many threads if one has tons of cores per node (not 
that unusual)

We should also audit all Weight classes in Solr for thread safety (if it hasn't 
been done yet.) . Relying on existing tests to catch stuff like that won't work 
that well for catching race conditions.

> Explore collector managers for multi-threaded search
> 
>
> Key: SOLR-13350
> URL: https://issues.apache.org/jira/browse/SOLR-13350
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
> Attachments: SOLR-13350.patch, SOLR-13350.patch, SOLR-13350.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> AFAICT, SolrIndexSearcher can be used only to search all the segments of an 
> index in series. However, using CollectorManagers, segments can be searched 
> concurrently and result in reduced latency. Opening this issue to explore the 
> effectiveness of using CollectorManagers in SolrIndexSearcher from latency 
> and throughput perspective.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13437) fork noggit code to Solr

2019-05-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839566#comment-16839566
 ] 

Yonik Seeley commented on SOLR-13437:
-

I'm fine with forking... I haven't had a chance to do anything with noggit 
recently.
It might make things easier to keep the same namespace though (for anyone in 
Solr who uses the noggit APIs directly)

> fork noggit code to Solr
> 
>
> Key: SOLR-13437
> URL: https://issues.apache.org/jira/browse/SOLR-13437
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrJ
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We rely on noggit for all our JSON encoding/decoding needs.The main project 
> is not actively maintained . We cannot easily switch to another parser 
> because it may cause backward incompatibility and we have advertised the 
> ability to use flexible JSON and we also use noggit internally in many classes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-05-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839437#comment-16839437
 ] 

Yonik Seeley commented on LUCENE-8753:
--

Thanks Bruno, awesome stuff!  A single FST for multiple fields is an important 
optimization.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-08 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835597#comment-16835597
 ] 

Yonik Seeley commented on LUCENE-8796:
--

Hmmm, that looks like it's searching the whole space each time instead of 
starting that the current point?

Presumably this:
{code}
  while(bound < length && docs[bound] < target) {
{code}
Should be something like this:
{code}
  while(i+bound < length && docs[i+bound] < target) {
{code}
And also adjust the bounds of the following binary search to match as well.


> Use exponential search in IntArrayDocIdSet advance method
> -
>
> Key: LUCENE-8796
> URL: https://issues.apache.org/jira/browse/LUCENE-8796
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>
> Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making 
> its advance method use exponential search instead of binary search. This 
> should help performance of queries including conjunctions: given that 
> ConjunctionDISI uses leap frog, it advances through doc ids in small steps, 
> hence exponential search should be faster when advancing on average compared 
> to binary search.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13320) add a param ignoreVersionConflicts=true to updates to not overwrite existing docs

2019-05-05 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833460#comment-16833460
 ] 

Yonik Seeley commented on SOLR-13320:
-

Hmmm, when I read "ignoreVersionConflicts" I assumed the wrong behavior... go 
ahead and add even if there is a version conflict.  We aren't really ignoring 
it, but rather continuing on to the next update/doc in the batch after it 
happened?

I'm not sure if I can think if a better name though... thinking along the lines 
of [~gus_heck], 
maybe something like "continueOnVersionConflict" (or "continueOnError" for the 
general case)?

> add a param ignoreVersionConflicts=true to updates to not overwrite existing 
> docs
> -
>
> Key: SOLR-13320
> URL: https://issues.apache.org/jira/browse/SOLR-13320
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Attachments: SOLR-13320.patch, SOLR-13320.patch
>
>
> Updates should have an option to ignore duplicate documents and drop them if 
> an option  {{ignoreDuplicates=true}} is specified



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13431) Efficient updates with shared storage

2019-04-29 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13431:

Description: 
h2. Background & problem statement:

With shared storage support, data durability is handled by the storage layer 
(e.g. S3 or HDFS) and replicas are not needed for durability. This changes the 
nature of how a single update (say adding a document) must be handled. The 
local transaction log does not help... a node can go down and never come back. 
The implication is that *a commit must be done for any updates to be considered 
durable.*

The problem is also more complex than just batching updates and adding a commit 
at the end of a batch. Consider indexing documents A,B,C,D followed by a commit:
 1) documents A,B sent to leader1 and indexed
 2) leader1 fails, leader2 is elected
 3) documents C,D sent to leader2 and indexed
 4) commit
 After this sequence of events, documents A,B are actually lost because a 
commit was not done on leader1 before it failed.

Adding a commit for every single update would fix the problem of data loss, but 
would obviously be too expensive (and each commit will be more expensive We can 
still do batches if we *disable transparent failover* for a batch.
 - all updates in a batch (for a specific shard) should be indexed on the *same 
leader*... any change in leadership should result in a failure at the low level 
instead of any transparent failover or forwarding.
 - in the event of a failure, *all updates since the last commit must be 
replayed* (we can't just retry the failure itself), or the failure will need to 
be bubbled up to a higher layer to retry from the beginning.

h2. Indexing scenario 1: CSV upload

If SolrCloud is loading a large CSV file, The receiving Solr node will forward 
updates to the correct leaders. This happens in the DistributedUpdateProcessor 
via SolrCmdDistributor, which ends up using a ConcurrentUpdateHttp2SolrClient 
subclass.

Fixing this scenario for shared storage in the simplest way would entail adding 
a commit to every update, which would be way to slow.

The forward-to-replica use case here is quite different than the 
forward-to-correct-leader (the latter has the current solr node acting much 
more like an external client.).  To simpliify development, we may want to 
separate these cases and continue using the existing code for 
forward-to-replica. 

h2. Indexing scenario 2: SolrJ bulk indexing

In this scenario, a client is trying to do a large amount of indexing and can 
use batches or streaming. For this scenario, we could just require that a 
commit be added for each batch and then fail a batch on any leader change. This 
is problematic for a couple of reasons:
 - larger batches add latency to build, hurting throughput
 - doesn't scale well - as a collection grows, the number of shards grow and 
the chance that any shard leader goes down (or the shard is split) goes up. 
Requiring that the entire batch (all shards) be replayed when this happens is 
wasteful and gets worse with collection growth.

h2. Proposed Solution: a SolrJ cloud aware streaming client
 - something like ConcurrentUpdateHttp2SolrClient that can stream and know 
about cloud layout
 - track when last commit happened for each shard leader
 - buffer updates per-shard since the last commit happened
 -- doesn't have to be exact... assume idempotent updates here, so overlap is 
fine
 -- buffering would also be triggered by the replica type of the collection (so 
this class could be used for both shared storage and normal NRT replicas) 
 - a parameter would be passed that would disallow any forwarding (since we're 
handling buffering/failover at this level)
 - on a failure because of a leader going down or loss of leadership, wait 
until a new leader has been elected and then replay updates since the last 
commit
 - insert commits where necessary to prevent buffers from growing too large
 -- inserted commits should be able to proceed in parallel... we shouldn't need 
to block and wait for a commit before resuming to send documents to that leader.
 -- it would be nice if there was a way we could get notified if a commit 
happened via some other mechanism (like an autoCommit being triggered)
  --- assuming we can't get this, perhaps we should pass a flag that disables 
triggering auto-commits for these batch updates?
 - handle splits (not only can a shard leader change, but a shard could 
split... buffered updates may need to be re-slotted)
 - need to handle a leader "bounce" like a change in leadership (assuming we're 
skipping using the transaction log)
 - multi-threaded - all updates to a leader regardless of thread are managed as 
a single update stream
 -- this perhaps provides a way to coalesce incremental/realtime updates
 - OPTIONAL: ability to have multiple channels to a single leader?
 -- we would need to avoid reordering updates to the same ID
 -- an 

[jira] [Updated] (SOLR-13431) Efficient updates with shared storage

2019-04-29 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13431:

Description: 
h2. Background & problem statement:

With shared storage support, data durability is handled by the storage layer 
(e.g. S3 or HDFS) and replicas are not needed for durability. This changes the 
nature of how a single update (say adding a document) must be handled. The 
local transaction log does not help... a node can go down and never come back. 
The implication is that *a commit must be done for any updates to be considered 
durable.*

The problem is also more complex than just batching updates and adding a commit 
at the end of a batch. Consider indexing documents A,B,C,D followed by a commit:
 1) documents A,B sent to leader1 and indexed
 2) leader1 fails, leader2 is elected
 3) documents C,D sent to leader2 and indexed
 4) commit
 After this sequence of events, documents A,B are actually lost because a 
commit was not done on leader1 before it failed.

Adding a commit for every single update would fix the problem of data loss, but 
would obviously be too expensive (and each commit will be more expensive We can 
still do batches if we *disable transparent failover* for a batch.
 - all updates in a batch (for a specific shard) should be indexed on the *same 
leader*... any change in leadership should result in a failure at the low level 
instead of any transparent failover or forwarding.
 - in the event of a failure, *all updates since the last commit must be 
replayed* (we can't just retry the failure itself), or the failure will need to 
be bubbled up to a higher layer to retry from the beginning.

h2. Indexing scenario 1: CSV upload

If SolrCloud is loading a large CSV file, The receiving Solr node will forward 
updates to the correct leaders. This happens in the DistributedUpdateProcessor 
via SolrCmdDistributor, which ends up using a ConcurrentUpdateHttp2SolrClient 
subclass.

The forward-to-replica use case here is quite different than the 
forward-to-correct-leader (the latter has the current solr node acting much 
more like an external client.).  To simpliify development, we may want to 
separate these cases and continue using the existing code for 
forward-to-replica. 

h2. Indexing scenario 2: SolrJ bulk indexing

In this scenario, a client is trying to do a large amount of indexing and can 
use batches or streaming. For this scenario, we could just require that a 
commit be added for each batch and then fail a batch on any leader change. This 
is problematic for a couple of reasons:
 - larger batches add latency to build, hurting throughput
 - doesn't scale well - as a collection grows, the number of shards grow and 
the chance that any shard leader goes down (or the shard is split) goes up. 
Requiring that the entire batch (all shards) be replayed when this happens is 
wasteful and gets worse with collection growth.

h2. Proposed Solution: a SolrJ cloud aware streaming client
 - something like ConcurrentUpdateHttp2SolrClient that can stream and know 
about cloud layout
 - track when last commit happened for each shard leader
 - buffer updates per-shard since the last commit happened
 -- doesn't have to be exact... assume idempotent updates here, so overlap is 
fine
 -- buffering would also be triggered by the replica type of the collection (so 
this class could be used for both shared storage and normal NRT replicas) 
 - a parameter would be passed that would disallow any forwarding (since we're 
handling buffering/failover at this level)
 - on a failure because of a leader going down or loss of leadership, wait 
until a new leader has been elected and then replay updates since the last 
commit
 - insert commits where necessary to prevent buffers from growing too large
 -- inserted commits should be able to proceed in parallel... we shouldn't need 
to block and wait for a commit before resuming to send documents to that leader.
 -- it would be nice if there was a way we could get notified if a commit 
happened via some other mechanism (like an autoCommit being triggered)
  --- assuming we can't get this, perhaps we should pass a flag that disables 
triggering auto-commits for these batch updates?
 - handle splits (not only can a shard leader change, but a shard could 
split... buffered updates may need to be re-slotted)
 - need to handle a leader "bounce" like a change in leadership (assuming we're 
skipping using the transaction log)
 - multi-threaded - all updates to a leader regardless of thread are managed as 
a single update stream
 -- this perhaps provides a way to coalesce incremental/realtime updates
 - OPTIONAL: ability to have multiple channels to a single leader?
 -- we would need to avoid reordering updates to the same ID
 -- an alternative to attempting to create more parallelism-per-shard on the 
client side is to do it on the server side.

  was:
h2. Background & 

[jira] [Created] (SOLR-13431) Efficient updates with shared storage

2019-04-26 Thread Yonik Seeley (JIRA)
Yonik Seeley created SOLR-13431:
---

 Summary: Efficient updates with shared storage
 Key: SOLR-13431
 URL: https://issues.apache.org/jira/browse/SOLR-13431
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Yonik Seeley


h2. Background & problem statement:

With shared storage support, data durability is handled by the storage layer 
(e.g. S3 or HDFS) and replicas are not needed for durability. This changes the 
nature of how a single update (say adding a document) must be handled. The 
local transaction log does not help... a node can go down and never come back. 
The implication is that *a commit must be done for any updates to be considered 
durable.*

The problem is also more complex than just batching updates and adding a commit 
at the end of a batch. Consider indexing documents A,B,C,D followed by a commit:
 1) documents A,B sent to leader1 and indexed
 2) leader1 fails, leader2 is elected
 3) documents C,D sent to leader2 and indexed
 4) commit
 After this sequence of events, documents A,B are actually lost because a 
commit was not done on leader1 before it failed.

Adding a commit for every single update would fix the problem of data loss, but 
would obviously be too expensive (and each commit will be more expensive We can 
still do batches if we *disable transparent failover* for a batch.
 - all updates in a batch (for a specific shard) should be indexed on the *same 
leader*... any change in leadership should result in a failure at the low level 
instead of any transparent failover or forwarding.
 - in the event of a failure, *all updates since the last commit must be 
replayed* (we can't just retry the failure itself), or the failure will need to 
be bubbled up to a higher layer to retry from the beginning.

h2. Indexing scenario 1: CSV upload

If SolrCloud is loading a large CSV file, The receiving Solr node will forward 
updates to the correct leaders. This happens in the DistributedUpdateProcessor 
via SolrCmdDistributor, which ends up using a ConcurrentUpdateHttp2SolrClient 
subclass.
h2. Indexing scenario 2: SolrJ bulk indexing

In this scenario, a client is trying to do a large amount of indexing and can 
use batches or streaming. For this scenario, we could just require that a 
commit be added for each batch and then fail a batch on any leader change. This 
is problematic for a couple of reasons:
 - larger batches add latency to build, hurting throughput
 - doesn't scale well - as a collection grows, the number of shards grow and 
the chance that any shard leader goes down (or the shard is split) goes up. 
Requiring that the entire batch (all shards) be replayed when this happens is 
wasteful and gets worse with collection growth.

h2. Proposed Solution: a SolrJ cloud aware streaming client
 - something like ConcurrentUpdateHttp2SolrClient that can stream and know 
about cloud layout
 - track when last commit happened for each shard leader
 - buffer updates per-shard since the last commit happened
 -- doesn't have to be exact... assume idempotent updates here, so overlap is 
fine
 -- buffering would also be triggered by the replica type of the collection (so 
this class could be used for both shared storage and normal NRT replicas) 
 - a parameter would be passed that would disallow any forwarding (since we're 
handling buffering/failover at this level)
 - on a failure because of a leader going down or loss of leadership, wait 
until a new leader has been elected and then replay updates since the last 
commit
 - insert commits where necessary to prevent buffers from growing too large
 -- inserted commits should be able to proceed in parallel... we shouldn't need 
to block and wait for a commit before resuming to send documents to that leader.
 -- it would be nice if there was a way we could get notified if a commit 
happened via some other mechanism (like an autoCommit being triggered)
  --- assuming we can't get this, perhaps we should pass a flag that disables 
triggering auto-commits for these batch updates?
 - handle splits (not only can a shard leader change, but a shard could 
split... buffered updates may need to be re-slotted)
 - need to handle a leader "bounce" like a change in leadership (assuming we're 
skipping using the transaction log)
 - multi-threaded - all updates to a leader regardless of thread are managed as 
a single update stream
 -- this perhaps provides a way to coalesce incremental/realtime updates
 - OPTIONAL: ability to have multiple channels to a single leader?
 -- we would need to avoid reordering updates to the same ID
 -- an alternative to attempting to create more parallelism-per-shard on the 
client side is to do it on the server side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To 

[jira] [Commented] (SOLR-13405) Support 1 or 0 replicas per shard

2019-04-15 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818312#comment-16818312
 ] 

Yonik Seeley commented on SOLR-13405:
-

0 replica support thoughts:
The idea of bringing up another replica if 1 replica seems down can naturally 
be extended to include 0 replica support.  The idea can be recast as requesting 
a new replica on demand if all existing replicas (including 0) seem down to a 
client.  One area where this is a little different is the indexing side... 
there would need to be code in the indexing paths that recognize 0 replicas 
configured and bring one up on demand.  After a certain period of inactivity, 
we'd want to return to 0 replicas.  This could probably be split off into a 
different JIRA.


> Support 1 or 0 replicas per shard
> -
>
> Key: SOLR-13405
> URL: https://issues.apache.org/jira/browse/SOLR-13405
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Yonik Seeley
>Priority: Major
>
> When multiple replicas per shard are not needed for data durability (because 
> of shared storage support on HDFS or S3, etc), other cluster configurations 
> suddenly make sense like allowing 1 or even 0 replicas per shard (primarily 
> to lower costs.)
> One big issue with a single replica per shard is that zookeeper (and thus the 
> overseer) waits for a session timeout before marking the node as down.  
> Instead of queries having to wait this long (~30 sec), if a SolrJ query 
> client detects that a node died, it can ask the overseer to quickly bring up 
> another replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13405) Support 1 or 0 replicas per shard

2019-04-15 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818162#comment-16818162
 ] 

Yonik Seeley commented on SOLR-13405:
-

Some design considerations / thoughts:
 - the node/replica should not be marked down in ZK based on client 
detection... it should only cause a temporary new replica to be quickly brought 
up for querying.
 - this will have no effect on who is the leader... hence this only helps query 
side (which is normally much more latency sensitive).
 - overseer should dedup requests since multiple clients detecting a node going 
down will all request new replicas.
 -- to aid in this deduplication, client should include in its request which 
replica it detected as down
 - Node vs Core (replica) down detection? To lessen the impact of false down 
detection, and to speed completion of the current query, only request new 
replicas for the shards that are being queried (as opposed to all shards on the 
node that went down)
 - Return to normal state - at some point, we should return to the normal 
number of replicas.  Use autoscale framework for this?

> Support 1 or 0 replicas per shard
> -
>
> Key: SOLR-13405
> URL: https://issues.apache.org/jira/browse/SOLR-13405
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Yonik Seeley
>Priority: Major
>
> When multiple replicas per shard are not needed for data durability (because 
> of shared storage support on HDFS or S3, etc), other cluster configurations 
> suddenly make sense like allowing 1 or even 0 replicas per shard (primarily 
> to lower costs.)
> One big issue with a single replica per shard is that zookeeper (and thus the 
> overseer) waits for a session timeout before marking the node as down.  
> Instead of queries having to wait this long (~30 sec), if a SolrJ query 
> client detects that a node died, it can ask the overseer to quickly bring up 
> another replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13405) Support 1 or 0 replicas per shard

2019-04-15 Thread Yonik Seeley (JIRA)
Yonik Seeley created SOLR-13405:
---

 Summary: Support 1 or 0 replicas per shard
 Key: SOLR-13405
 URL: https://issues.apache.org/jira/browse/SOLR-13405
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Yonik Seeley


When multiple replicas per shard are not needed for data durability (because of 
shared storage support on HDFS or S3, etc), other cluster configurations 
suddenly make sense like allowing 1 or even 0 replicas per shard (primarily to 
lower costs.)

One big issue with a single replica per shard is that zookeeper (and thus the 
overseer) waits for a session timeout before marking the node as down.  Instead 
of queries having to wait this long (~30 sec), if a SolrJ query client detects 
that a node died, it can ask the overseer to quickly bring up another replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13399) compositeId support for shard splitting

2019-04-12 Thread Yonik Seeley (JIRA)
Yonik Seeley created SOLR-13399:
---

 Summary: compositeId support for shard splitting
 Key: SOLR-13399
 URL: https://issues.apache.org/jira/browse/SOLR-13399
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Yonik Seeley


Shard splitting does not currently have a way to automatically take into 
account the actual distribution (number of documents) in each hash bucket 
created by using compositeId hashing.

We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* command 
that would look at the number of docs sharing each compositeId prefix and use 
that to create roughly equal sized buckets by document count rather than just 
assuming an equal distribution across the entire hash range.

Like normal shard splitting, we should bias against splitting within hash 
buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
this warrants a parameter that would control how much of a size mismatch is 
tolerable before resorting to splitting within a bucket. 
*allowedSizeDifference*?

To more quickly calculate the number of docs in each bucket, we could index the 
prefix in a different field.  Iterating over the terms for this field would 
quickly give us the number of docs in each (i.e lucene keeps track of the doc 
count for each term already.)  Perhaps the implementation could be a flag on 
the *id* field... something like *indexPrefixes* and poly-fields that would 
cause the indexing to be automatically done and alleviate having to pass in an 
additional field during indexing and during the call to *SPLITSHARD*.  This 
whole part is an optimization though and could be split off into its own issue 
if desired.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13272) Interval facet support for JSON faceting

2019-04-10 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814833#comment-16814833
 ] 

Yonik Seeley commented on SOLR-13272:
-

bq. why it's a separate type and not just optional property to type:range? 

I agree it would probably be nicer to just have it as part of a range facet... 
that way other range parameters like "other", "include", etc could be 
(eventually) supported / reused.

> Interval facet support for JSON faceting
> 
>
> Key: SOLR-13272
> URL: https://issues.apache.org/jira/browse/SOLR-13272
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Apoorv Bhawsar
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Interval facet is supported in classical facet component but has no support 
> in json facet requests.
>  In cases of block join and aggregations, this would be helpful
> Assuming request format -
> {code:java}
> json.facet={pubyear:{type : interval,field : 
> pubyear_i,intervals:[{key:"2000-2200",value:"[2000,2200]"}]}}
> {code}
>  
>  PR https://github.com/apache/lucene-solr/pull/597



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11

2019-04-10 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814705#comment-16814705
 ] 

Yonik Seeley commented on LUCENE-8738:
--

bq. I think the Observable/Observer is uncritical.

I agree.  Pluggable transient core cache is super-expert level (almost more 
like internals) and if anyone actually uses it they can adapt when upgrading.
I did a quick scan of the related changes and they look fine.

> Bump minimum Java version requirement to 11
> ---
>
> Key: LUCENE-8738
> URL: https://issues.apache.org/jira/browse/LUCENE-8738
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: Java11
> Fix For: master (9.0)
>
>
> See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13323) Remove org.apache.solr.internal.csv.writer.CSVWriter (and related classes)

2019-03-24 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16800307#comment-16800307
 ] 

Yonik Seeley commented on SOLR-13323:
-

bq. Is there any reason to believe from it's past history (of which I know 
nothing)

A quick history is that Solr needed a non-official commons-csv release and so 
the source was copied (but apparently all of the source and not just what was 
needed.)
No deprecations are necessary for removal.

> Remove org.apache.solr.internal.csv.writer.CSVWriter (and related classes)
> --
>
> Key: SOLR-13323
> URL: https://issues.apache.org/jira/browse/SOLR-13323
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (9.0)
>Reporter: Gus Heck
>Priority: Minor
>
> This class appears to only be used in the test for itself. It's also easily 
> confused with org.apache.solr.response.CSVWriter 
> I propose we remove this class entirely. Is there any reason to believe from 
> it's past history (of which I know nothing) that it might be depended upon by 
> outside code and require a deprecation cycle? 
> Presently it contains a System.out.println and a eclipse generated catch 
> block that precommit won't like if we enable checking for System.out.println, 
> which is why this ticket is a sub-task. If we do need to deprecate it then I 
> propose we remove the print and simply re-throw the exception as a 
> RuntimeException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6237) An option to have only leaders write and replicas read when using a shared file system with SolrCloud.

2019-03-13 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792202#comment-16792202
 ] 

Yonik Seeley commented on SOLR-6237:


bq. Thanks for the pointers Yonik! Based on the linked presentation, there is a 
working prototype in place at SalesForce. Is there a way I can help in the 
implementation or testing?

The code/impl referenced in the presentation is only for Solr stand-alone (not 
SolrCloud.)  Hopefully we'll have something (rough proof-of-concept stuff) to 
share in come coming weeks though.  In the meantime feel free to share your 
thoughts on the linked issues. 



> An option to have only leaders write and replicas read when using a shared 
> file system with SolrCloud.
> --
>
> Key: SOLR-6237
> URL: https://issues.apache.org/jira/browse/SOLR-6237
> Project: Solr
>  Issue Type: New Feature
>  Components: hdfs, SolrCloud
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Attachments: 0001-unified.patch, SOLR-6237.patch, Unified Replication 
> Design.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6237) An option to have only leaders write and replicas read when using a shared file system with SolrCloud.

2019-03-09 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788704#comment-16788704
 ] 

Yonik Seeley commented on SOLR-6237:


Hi Peter, I opened SOLR-13101 and SOLR-13102 recently... I had lost track of 
this issue until you commented on it yesterday.

> An option to have only leaders write and replicas read when using a shared 
> file system with SolrCloud.
> --
>
> Key: SOLR-6237
> URL: https://issues.apache.org/jira/browse/SOLR-6237
> Project: Solr
>  Issue Type: New Feature
>  Components: hdfs, SolrCloud
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
> Attachments: 0001-unified.patch, SOLR-6237.patch, Unified Replication 
> Design.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9682) Ability to specify a query with a parameter name (in facet filter)

2019-02-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759171#comment-16759171
 ] 

Yonik Seeley commented on SOLR-9682:


> What it someone make a typo when attempting to filter out some explicit 
> content?

If someone adds a filter and it doesn't work, the filter (and how it's 
specified via param) will be the first thing they look at (hence a typo should 
be easy to debug).  Removing a feature to allow detecting of one very specific 
typo doesn't seem like a good trade-off in this specific scenario.

It's a common scenario to want to filter if one is provided.  It makes it 
easier to have a request that doesn't have to be modified as much based on the 
absence/presence of other parameters.

Also, "Multi-valued parameters should be supported." was part of the objective. 
 So the parameter refers to a list of filters... and allowing "0 or more" for a 
list is more flexible than "you're not allowed to have a 0 length list".


> Ability to specify a query with a parameter name (in facet filter)
> --
>
> Key: SOLR-9682
> URL: https://issues.apache.org/jira/browse/SOLR-9682
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 6.4, 7.0
>
> Attachments: SOLR-9682.patch
>
>
> Currently, "filter" only supports query strings (examples at 
> http://yonik.com/solr-json-request-api/ )
> It would be nice to be able to reference a param that would be parsed as a 
> lucene/solr query.  Multi-valued parameters should be supported.
> We should keep in mind (and leave room for) a future "JSON Query Syntax" and 
> chose labels appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13101) Shared storage support in SolrCloud

2019-01-28 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754041#comment-16754041
 ] 

Yonik Seeley commented on SOLR-13101:
-

Thinking about how to kick this off... 
At the most basic level, looking at the HDFS layout scheme we see this ("test" 
is the name of the collection):
{code}
local_file_system://.../node1/test_shard1_replica_n1/core.properties
hdfs://.../data/test/core_node2/data/
{code}
And core.properties looks like:
{code}
numShards=1
collection.configName=conf1
name=test_shard1_replica_n1
replicaType=NRT
shard=shard1
collection=test
coreNodeName=core_node2
{code}

It seems like the most basic desirable change would be to the naming scheme for 
collections with shared storage.
Instead of ...///data
it should be ...///data
since there is only one canonical index per shard.



> Shared storage support in SolrCloud
> ---
>
> Key: SOLR-13101
> URL: https://issues.apache.org/jira/browse/SOLR-13101
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Yonik Seeley
>Priority: Major
>
> Solr should have first-class support for shared storage (blob/object stores 
> like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, 
> etc).
> The key component will likely be a new replica type for shared storage.  It 
> would have many of the benefits of the current "pull" replicas (not indexing 
> on all replicas, all shards identical with no shards getting out-of-sync, 
> etc), but would have additional benefits:
>  - Any shard could become leader (the blob store always has the index)
>  - Better elasticity scaling down
>- durability not linked to number of replcias.. a single replica could be 
> common for write workloads
>- could drop to 0 replicas for a shard when not needed (blob store always 
> has index)
>  - Allow for higher performance write workloads by skipping the transaction 
> log
>- don't pay for what you don't need
>- a commit will be necessary to flush to stable storage (blob store)
>  - A lot of the complexity and failure modes go away
> An additional component a Directory implementation that will work well with 
> blob stores.  We probably want one that treats local disk as a cache since 
> the latency to remote storage is so large.  I think there are still some 
> "locking" issues to be solved here (ensuring that more than one writer to the 
> same index won't corrupt it).  This should probably be pulled out into a 
> different JIRA issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13165) enabling docValues on a tdate field and searching on the field is very slow

2019-01-24 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751303#comment-16751303
 ] 

Yonik Seeley commented on SOLR-13165:
-

Are you sure that the field was indexed both times?
As long as the tdate field is indexed, that index should be used for queries, 
regardless of if it has docValues.

> enabling docValues on a tdate field and searching on the field is very slow
> ---
>
> Key: SOLR-13165
> URL: https://issues.apache.org/jira/browse/SOLR-13165
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Sheeba Dhanaraj
>Priority: Major
>
> when we enable docValues on a tdate field and search on the field response 
> time is very slow. when we remove docValues from the field performance is 
> significantly improved. Is this by design? should we not enable docValues for 
> tdate fields



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13156) Limiting field facet with certain terms via {!terms} not taking into account sorting

2019-01-23 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749886#comment-16749886
 ] 

Yonik Seeley commented on SOLR-13156:
-

Interesting IIRC, this wasn't a public API, and was only used internally 
for facet refinement (hence no need for sorting.)
It looks like at some point it got documented as a public API, so I guess it is 
now.

> Limiting field facet with certain terms via {!terms} not taking into account 
> sorting
> 
>
> Key: SOLR-13156
> URL: https://issues.apache.org/jira/browse/SOLR-13156
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Konstantin Perikov
>Priority: Major
>
> When I'm doing limiting facet keys with \{!terms} it doesn't take into 
> account sorting.
> First query not limiting the facet keys:
> {{facet.field=title=count=on=*:*}}
> Response as expected:
> {{"facet_counts":\{ "facet_queries":{}, "facet_fields":\{ "title":[ 
> "book2",3, "book1",2, "book3",1]}, "facet_ranges":{}, "facet_intervals":{}, 
> "facet_heatmaps":{}
>  
> When doing it with limiting:
> {{facet.field=\{!terms=Book3,Book2,Book1}title=count=on=*:*}}
> I'm getting the exact order of how I list terms:
> {{"facet_counts":\{ "facet_queries":{}, "facet_fields":\{ "title":[ 
> "Book3",1, "Book2",3, "Book1",2]}, "facet_ranges":{}, "facet_intervals":{}, 
> "facet_heatmaps":{}
> I've looked at the code, and it's clearly an issue there:
>  
> org.apache.solr.request.SimpleFacets#getListedTermCounts
>  
> {{for (String term : terms) {}}
> {{    int count = searcher.numDocs(ft.getFieldQuery(null, sf, term), 
> parsed.docs);}}
> {{    res.add(term, count);}}
> {{}}}
>  
> it's just basically iterating over terms and don't do any sorting at all. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13102) Shared storage Directory implementation

2019-01-02 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13102:

Description: 
We need a general strategy (and probably a general base class) that can work 
with shared storage and not corrupt indexes from multiple writers.

One strategy that is used on local disk is to use locks.  This doesn't extend 
well to remote / shared filesystems when the locking is not tied into the 
object store itself since a process can lose the lock (a long GC or whatever) 
and then immediately try to write a file and there is no way to stop it.

An alternate strategy ditches the use of locks and simply avoids overwriting 
files by some algorithmic mechanism.
One of my colleagues outlined one way to do this: 
https://www.youtube.com/watch?v=UeTFpNeJ1Fo
That strategy uses random looking filenames and then writes a "core.metadata" 
file that maps between the random names and the original names.  The problem is 
then reduced to overwriting "core.metadata" when you lose the lock.  One way to 
fix this is to version "core.metadata".  Since the new leader election code was 
implemented, each shard as a monotonically increasing "leader term", and we can 
use that as part of the filename.  When a reader goes to open an index, it can 
use the latest file from the directory listing, or even use the term obtained 
from ZK if we can't trust the directory listing to be up to date.  
Additionally, we don't need random filenames to avoid collisions... a simple 
unique prefix or suffix would work fine (such as the leader term again)



  was:
We need a general strategy (and probably a general base class) that can work 
with shared storage and not corrupt indexes from multiple writers.

One strategy that is used on local disk is to use locks.  This doesn't extend 
well to remote / shared filesystems when the locking is not tied into the 
object store itself since a process can lose the lock (a long GC or whatever) 
and then immediately try to write a file and there is no way to stop it.

An alternate strategy ditches the use of locks and simply avoids overwriting 
files by some algorithmic mechanism.
One of my colleagues outlined one way to do this: 
https://www.youtube.com/watch?v=UeTFpNeJ1Fo
That strategy uses random looking filenames and then writes a "core.metadata" 
file that maps between the random names and the original names.  The problem is 
then reduced to overwriting "core.metadata" when you lose the lock.  One way to 
fix this is to version "core.metadata".  Since the new leader election code was 
implemented, each shard as a monotonically increasing "leader term", and we can 
use that as part of the filename.  When a reader goes to open an index, it can 
use the latest file from the directory listing, or even use the term obtained 
from ZK if we can't trust the directory listing to be up to date.




> Shared storage Directory implementation
> ---
>
> Key: SOLR-13102
> URL: https://issues.apache.org/jira/browse/SOLR-13102
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Yonik Seeley
>Priority: Major
>
> We need a general strategy (and probably a general base class) that can work 
> with shared storage and not corrupt indexes from multiple writers.
> One strategy that is used on local disk is to use locks.  This doesn't extend 
> well to remote / shared filesystems when the locking is not tied into the 
> object store itself since a process can lose the lock (a long GC or whatever) 
> and then immediately try to write a file and there is no way to stop it.
> An alternate strategy ditches the use of locks and simply avoids overwriting 
> files by some algorithmic mechanism.
> One of my colleagues outlined one way to do this: 
> https://www.youtube.com/watch?v=UeTFpNeJ1Fo
> That strategy uses random looking filenames and then writes a "core.metadata" 
> file that maps between the random names and the original names.  The problem 
> is then reduced to overwriting "core.metadata" when you lose the lock.  One 
> way to fix this is to version "core.metadata".  Since the new leader election 
> code was implemented, each shard as a monotonically increasing "leader term", 
> and we can use that as part of the filename.  When a reader goes to open an 
> index, it can use the latest file from the directory listing, or even use the 
> term obtained from ZK if we can't trust the directory listing to be up to 
> date.  Additionally, we don't need random filenames to avoid collisions... a 
> simple unique prefix or suffix would work fine (such as the leader term again)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Created] (SOLR-13102) Shared storage Directory implementation

2019-01-02 Thread Yonik Seeley (JIRA)
Yonik Seeley created SOLR-13102:
---

 Summary: Shared storage Directory implementation
 Key: SOLR-13102
 URL: https://issues.apache.org/jira/browse/SOLR-13102
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Yonik Seeley


We need a general strategy (and probably a general base class) that can work 
with shared storage and not corrupt indexes from multiple writers.

One strategy that is used on local disk is to use locks.  This doesn't extend 
well to remote / shared filesystems when the locking is not tied into the 
object store itself since a process can lose the lock (a long GC or whatever) 
and then immediately try to write a file and there is no way to stop it.

An alternate strategy ditches the use of locks and simply avoids overwriting 
files by some algorithmic mechanism.
One of my colleagues outlined one way to do this: 
https://www.youtube.com/watch?v=UeTFpNeJ1Fo
That strategy uses random looking filenames and then writes a "core.metadata" 
file that maps between the random names and the original names.  The problem is 
then reduced to overwriting "core.metadata" when you lose the lock.  One way to 
fix this is to version "core.metadata".  Since the new leader election code was 
implemented, each shard as a monotonically increasing "leader term", and we can 
use that as part of the filename.  When a reader goes to open an index, it can 
use the latest file from the directory listing, or even use the term obtained 
from ZK if we can't trust the directory listing to be up to date.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13101) Shared storage support in SolrCloud

2019-01-02 Thread Yonik Seeley (JIRA)
Yonik Seeley created SOLR-13101:
---

 Summary: Shared storage support in SolrCloud
 Key: SOLR-13101
 URL: https://issues.apache.org/jira/browse/SOLR-13101
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Reporter: Yonik Seeley


Solr should have first-class support for shared storage (blob/object stores 
like S3, google cloud storage, etc. and shared filesystems like HDFS, NFS, etc).

The key component will likely be a new replica type for shared storage.  It 
would have many of the benefits of the current "pull" replicas (not indexing on 
all replicas, all shards identical with no shards getting out-of-sync, etc), 
but would have additional benefits:
 - Any shard could become leader (the blob store always has the index)
 - Better elasticity scaling down
   - durability not linked to number of replcias.. a single replica could be 
common for write workloads
   - could drop to 0 replicas for a shard when not needed (blob store always 
has index)
 - Allow for higher performance write workloads by skipping the transaction log
   - don't pay for what you don't need
   - a commit will be necessary to flush to stable storage (blob store)
 - A lot of the complexity and failure modes go away

An additional component a Directory implementation that will work well with 
blob stores.  We probably want one that treats local disk as a cache since the 
latency to remote storage is so large.  I think there are still some "locking" 
issues to be solved here (ensuring that more than one writer to the same index 
won't corrupt it).  This should probably be pulled out into a different JIRA 
issue.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13040) Harden TestSQLHandler.

2018-12-12 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719139#comment-16719139
 ] 

Yonik Seeley commented on SOLR-13040:
-

It's pretty strange... that error message "can not sort on a field..." is from 
a schema check and has nothing to do with what is in the index.
I tried looping the test overnight but couldn't reproduce it.
If I were to guess, it might be an issue in the test framework occasionally 
picking up the wrong schema or something?

> Harden TestSQLHandler.
> --
>
> Key: SOLR-13040
> URL: https://issues.apache.org/jira/browse/SOLR-13040
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Joel Bernstein
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8374) Reduce reads for sparse DocValues

2018-12-04 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708972#comment-16708972
 ] 

Yonik Seeley commented on LUCENE-8374:
--

bq. as for turning on optionally, then it was part of my first patch as a 
static global switch

That sounds like a good compromise... just make it expert/experimental so it 
can be removed later.
One nice thing about search-time is that it doesn't introduce any index format 
back compat issues - it can be evolved or removed partially or entirely when 
the index format improves.

> Reduce reads for sparse DocValues
> -
>
> Key: LUCENE-8374
> URL: https://issues.apache.org/jira/browse/LUCENE-8374
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
>  Labels: performance
> Attachments: LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, 
> LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, LUCENE-8374.patch, 
> LUCENE-8374_branch_7_3.patch, LUCENE-8374_branch_7_3.patch.20181005, 
> LUCENE-8374_branch_7_4.patch, LUCENE-8374_branch_7_5.patch, 
> LUCENE-8374_part_1.patch, LUCENE-8374_part_2.patch, LUCENE-8374_part_3.patch, 
> LUCENE-8374_part_4.patch, entire_index_logs.txt, 
> image-2018-10-24-07-30-06-663.png, image-2018-10-24-07-30-56-962.png, 
> single_vehicle_logs.txt, 
> start-2018-10-24-1_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png,
>  
> start-2018-10-24_snapshot___Users_tim_Snapshots__-_YourKit_Java_Profiler_2017_02-b75_-_64-bit.png
>
>
> The {{Lucene70DocValuesProducer}} has the internal classes 
> {{SparseNumericDocValues}} and {{BaseSortedSetDocValues}} (sparse code path), 
> which again uses {{IndexedDISI}} to handle the docID -> value-ordinal lookup. 
> The value-ordinal is the index of the docID assuming an abstract tightly 
> packed monotonically increasing list of docIDs: If the docIDs with 
> corresponding values are {{[0, 4, 1432]}}, their value-ordinals will be {{[0, 
> 1, 2]}}.
> h2. Outer blocks
> The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values 
> (65536), where each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16 
> values) or {{SPARSE}} (< 2^12 values ~= 6%). Consequently blocks vary quite a 
> lot in size and ordinal resolving strategy.
> When a sparse Numeric DocValue is needed, the code first locates the block 
> containing the wanted docID flag. It does so by iterating blocks one-by-one 
> until it reaches the needed one, where each iteration requires a lookup in 
> the underlying {{IndexSlice}}. For a common memory mapped index, this 
> translates to either a cached request or a read operation. If a segment has 
> 6M documents, worst-case is 91 lookups. In our web archive, our segments has 
> ~300M values: A worst-case of 4577 lookups!
> One obvious solution is to use a lookup-table for blocks: A long[]-array with 
> an entry for each block. For 6M documents, that is < 1KB and would allow for 
> direct jumping (a single lookup) in all instances. Unfortunately this 
> lookup-table cannot be generated upfront when the writing of values is purely 
> streaming. It can be appended to the end of the stream before it is closed, 
> but without knowing the position of the lookup-table the reader cannot seek 
> to it.
> One strategy for creating such a lookup-table would be to generate it during 
> reads and cache it for next lookup. This does not fit directly into how 
> {{IndexedDISI}} currently works (it is created anew for each invocation), but 
> could probably be added with a little work. An advantage to this is that this 
> does not change the underlying format and thus could be used with existing 
> indexes.
> h2. The lookup structure inside each block
> If {{ALL}} of the 2^16 values are defined, the structure is empty and the 
> ordinal is simply the requested docID with some modulo and multiply math. 
> Nothing to improve there.
> If the block is {{DENSE}} (2^12 to 2^16 values are defined), a bitmap is used 
> and the number of set bits up to the wanted index (the docID modulo the block 
> origo) are counted. That bitmap is a long[1024], meaning that worst case is 
> to lookup and count all set bits for 1024 longs!
> One known solution to this is to use a [rank 
> structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I 
> [implemented 
> it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]]
>  for a related project and with that (), the rank-overhead for a {{DENSE}} 
> block would be long[32] and would ensure a maximum of 9 lookups. It is not 
> trivial to build the rank-structure and caching it (assuming all blocks are 
> dense) for 6M 

[jira] [Commented] (SOLR-12839) add a 'resort' option to JSON faceting

2018-11-30 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705178#comment-16705178
 ] 

Yonik Seeley commented on SOLR-12839:
-

Yeah, I think this is OK - my main objection was going to be the name 
"approximate" which highly suggests that an estimate is fine. "prelim_sort" 
seems fine.

> add a 'resort' option to JSON faceting
> --
>
> Key: SOLR-12839
> URL: https://issues.apache.org/jira/browse/SOLR-12839
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-12839.patch, SOLR-12839.patch, SOLR-12839.patch, 
> SOLR-12839.patch
>
>
> As discusssed in SOLR-9480 ...
> bq. Similar to how the {{rerank}} request param allows people to collect & 
> score documents using a "cheap" query, and then re-score the top N using a 
> ore expensive query, I think it would be handy if JSON Facets supported a 
> {{resort}} option that could be used on any FacetRequestSorted instance right 
> along side the {{sort}} param, using the same JSON syntax, so that clients 
> could have Solr internaly sort all the facet buckets by something simple 
> (like count) and then "Re-Sort" the top N=limit (or maybe ( 
> N=limit+overrequest ?) using a more expensive function like skg()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-13024) ValueSourceAugmenter - avoid creating new FunctionValues per doc

2018-11-29 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-13024:

Summary: ValueSourceAugmenter - avoid creating new FunctionValues per doc   
(was: ValueSourceAugmenter )

> ValueSourceAugmenter - avoid creating new FunctionValues per doc 
> -
>
> Key: SOLR-13024
> URL: https://issues.apache.org/jira/browse/SOLR-13024
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: 7.0
>Reporter: Yonik Seeley
>Priority: Major
>
> The cutover to iterators in LUCENE-7407 caused ValueSourceAugmenter (which 
> handles functions in the "fl" param along side other fields) resulted in 
> FunctionValues being re-retrieved for every document.
> Caching could cut that in half, but we should really retrieve a window at a 
> time in order for best performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13024) ValueSourceAugmenter

2018-11-29 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704054#comment-16704054
 ] 

Yonik Seeley commented on SOLR-13024:
-

The change from  LUCENE-7407:
{code}
git show f7aa200d40  
./solr/core/src/java/org/apache/solr/response/transform/ValueSourceAugmenter.java
commit f7aa200d406dbd05a35d6116198302d90b92cb29
Author: Mike McCandless 
Date:   Wed Sep 21 09:41:41 2016 -0400

LUCENE-7407: switch doc values usage to an iterator API, based on 
DocIdSetIterator, instead of random acces, freeing codecs for future

diff --git 
a/solr/core/src/java/org/apache/solr/response/transform/ValueSourceAugmenter.java
 b/solr/core/src/java/org/apache/solr/response
index 9edf826e2c..c37dd80bfb 100644
--- 
a/solr/core/src/java/org/apache/solr/response/transform/ValueSourceAugmenter.java
+++ 
b/solr/core/src/java/org/apache/solr/response/transform/ValueSourceAugmenter.java
@@ -65,7 +65,6 @@ public class ValueSourceAugmenter extends DocTransformer
 try {
   searcher = context.getSearcher();
   readerContexts = searcher.getIndexReader().leaves();
-  docValuesArr = new FunctionValues[readerContexts.size()];
   fcontext = ValueSource.newContext(searcher);
   this.valueSource.createWeight(fcontext, searcher);
 } catch (IOException e) {
@@ -76,7 +75,6 @@ public class ValueSourceAugmenter extends DocTransformer
   Map fcontext;
   SolrIndexSearcher searcher;
   List readerContexts;
-  FunctionValues docValuesArr[];

   @Override
   public void transform(SolrDocument doc, int docid, float score) {
@@ -87,11 +85,7 @@ public class ValueSourceAugmenter extends DocTransformer
   // TODO: calculate this stuff just once across diff functions
   int idx = ReaderUtil.subIndex(docid, readerContexts);
   LeafReaderContext rcontext = readerContexts.get(idx);
-  FunctionValues values = docValuesArr[idx];
-  if (values == null) {
-docValuesArr[idx] = values = valueSource.getValues(fcontext, rcontext);
-  }
-
+  FunctionValues values = valueSource.getValues(fcontext, rcontext);
   int localId = docid - rcontext.docBase;
   setValue(doc,values.objectVal(localId));
{code}

> ValueSourceAugmenter 
> -
>
> Key: SOLR-13024
> URL: https://issues.apache.org/jira/browse/SOLR-13024
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: 7.0
>Reporter: Yonik Seeley
>Priority: Major
>
> The cutover to iterators in LUCENE-7407 caused ValueSourceAugmenter (which 
> handles functions in the "fl" param along side other fields) resulted in 
> FunctionValues being re-retrieved for every document.
> Caching could cut that in half, but we should really retrieve a window at a 
> time in order for best performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13024) ValueSourceAugmenter

2018-11-29 Thread Yonik Seeley (JIRA)
Yonik Seeley created SOLR-13024:
---

 Summary: ValueSourceAugmenter 
 Key: SOLR-13024
 URL: https://issues.apache.org/jira/browse/SOLR-13024
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: search
Affects Versions: 7.0
Reporter: Yonik Seeley


The cutover to iterators in LUCENE-7407 caused ValueSourceAugmenter (which 
handles functions in the "fl" param along side other fields) resulted in 
FunctionValues being re-retrieved for every document.

Caching could cut that in half, but we should really retrieve a window at a 
time in order for best performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-26 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699102#comment-16699102
 ] 

Yonik Seeley commented on SOLR-13013:
-

bq. Are you thinking about making something generic? Maybe a bulk request 
wrapper for doc values, that temporarily re-sorts internally? Maybe a bulk 
request wrapper for doc values, that temporarily re-sorts internally?

Yep.  Something that collects out-of-order docids along with other value 
sources that should be internally retrieved mostly in-order.
 It shouldn't slow up this issue though. I just bring it up to get it on other 
people's radar (it's been on my TODO list for years...) and because it's 
related to this issue.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-25 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698258#comment-16698258
 ] 

Yonik Seeley commented on SOLR-13013:
-

Great results!

Retrieving results in order in batches has also been a TODO for augmenters 
(specifically, the ability to retrieve function query results along side field 
results) since they were added to Solr since some function queries needed to be 
accessed in order to be efficient.  With the changes to iterators for 
docvalues, and the ability to retrieve stored fields using document values, 
this becomes even more important.


> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12074) Add numeric typed equivalents to StrField

2018-11-18 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691097#comment-16691097
 ] 

Yonik Seeley commented on SOLR-12074:
-

bq. It'd be nifty if PointField could additionally have a Terms index for these 
full-precision terms instead of requiring a separate field in the schema.

+1, it's important for it to be the same field in the schema for both 
usability, and so that solr knows how to optimize single-valued lookups.
If we could turn back time, I'd argue for keeping "indexed=true" in the schema 
to mean normal full-text index, and then use another name for the BKD structure 
(rangeIndexed=true? pointIndexed=true?), but I guess that ship has sailed.

So what should the name of the new flag for the schema be?
valueIndexed?
termIndexed?




> Add numeric typed equivalents to StrField
> -
>
> Key: SOLR-12074
> URL: https://issues.apache.org/jira/browse/SOLR-12074
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: David Smiley
>Priority: Major
>  Labels: newdev, numeric-tries-to-points
>
> There ought to be numeric typed equivalents to StrField in the schema.  The 
> TrieField types can be configured to do this with precisionStep=0, but the 
> TrieFields are deprecated and slated for removal in 8.0.  PointFields may be 
> adequate for some use cases but, unlike TrieField, it's not as efficient for 
> simple field:value lookup queries.  They probably should use the same 
> internal sortable full-precision term format that TrieField uses (details 
> currently in {{LegacyNumericUtils}} (which are used by the deprecated Trie 
> fields).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12632) Completely remove Trie fields

2018-11-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686838#comment-16686838
 ] 

Yonik Seeley commented on SOLR-12632:
-

If docValues are enabled, hopefully current point fields aren't slower for 
things like statistics.  But I could see them being slower for faceting (which 
uses single-value lookups for things like refinement, or calculating the domain 
for a sub-facet)

> Completely remove Trie fields
> -
>
> Key: SOLR-12632
> URL: https://issues.apache.org/jira/browse/SOLR-12632
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
>Priority: Blocker
>  Labels: numeric-tries-to-points
> Fix For: master (8.0)
>
>
> Trie fields were deprecated in Solr 7.0.  We should remove them completely 
> before we release Solr 8.0.
> Unresolved points-related issues: 
> [https://jira.apache.org/jira/issues/?jql=project=SOLR+AND+labels=numeric-tries-to-points+AND+resolution=unresolved]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12632) Completely remove Trie fields

2018-11-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686768#comment-16686768
 ] 

Yonik Seeley commented on SOLR-12632:
-

The performance hit seems more important than exactly when deprecated 
functionality is removed.
We should have a superior single numeric field that is better at both range 
queries and single value matches before we remove the existing field (trie) 
that can do both well.

> Completely remove Trie fields
> -
>
> Key: SOLR-12632
> URL: https://issues.apache.org/jira/browse/SOLR-12632
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
>Priority: Blocker
>  Labels: numeric-tries-to-points
> Fix For: master (8.0)
>
>
> Trie fields were deprecated in Solr 7.0.  We should remove them completely 
> before we release Solr 8.0.
> Unresolved points-related issues: 
> [https://jira.apache.org/jira/issues/?jql=project=SOLR+AND+labels=numeric-tries-to-points+AND+resolution=unresolved]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12638) Support atomic updates of nested/child documents for nested-enabled schema

2018-10-23 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661581#comment-16661581
 ] 

Yonik Seeley commented on SOLR-12638:
-

Somewhat related: perhaps it should be best practice to include the parent 
document id in the child document id (with a "!" separator).  Things should 
just then work for anyone following this convention with the default 
compositeRouter.  For example, "id:mybook!myreview".  The ability to specify 
_route_ explicitly should always be there of course.
 

> Support atomic updates of nested/child documents for nested-enabled schema
> --
>
> Key: SOLR-12638
> URL: https://issues.apache.org/jira/browse/SOLR-12638
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: mosh
>Priority: Major
> Attachments: SOLR-12638-delete-old-block-no-commit.patch, 
> SOLR-12638-nocommit.patch
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I have been toying with the thought of using this transformer in conjunction 
> with NestedUpdateProcessor and AtomicUpdate to allow SOLR to completely 
> re-index the entire nested structure. This is just a thought, I am still 
> thinking about implementation details. Hopefully I will be able to post a 
> more concrete proposal soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7996) Should we require positive scores?

2018-10-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649548#comment-16649548
 ] 

Yonik Seeley commented on LUCENE-7996:
--

Ah, I see.  Thanks for the pointer!

> Should we require positive scores?
> --
>
> Key: LUCENE-7996
> URL: https://issues.apache.org/jira/browse/LUCENE-7996
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (8.0)
>
> Attachments: LUCENE-7996.patch, LUCENE-7996.patch, LUCENE-7996.patch
>
>
> Having worked on MAXSCORE recently, things would be simpler if we required 
> that scores are positive. Practically, this would mean 
>  - forbidding/fixing similarities that may produce negative scores (we have 
> some of them)
>  - forbidding things like negative boosts
> So I'd be curious to have opinions whether this would be a sane requirement 
> or whether we need to be able to cope with negative scores eg. because some 
> similarities that we want to support produce negative scores by design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12711) Count dominating child field values

2018-10-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649537#comment-16649537
 ] 

Yonik Seeley commented on SOLR-12711:
-

Could think of it like a block limit I guess.  One way to specify would be a 
sort and a limit (i.e. you could select the 3 latest child documents).
This could also be extended beyond blocks to buckets/domains.

> Count dominating child field values
> ---
>
> Key: SOLR-12711
> URL: https://issues.apache.org/jira/browse/SOLR-12711
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Major
>
> h2. Context
> {{uniqueBlock(_root_)}} which was introduced in SOLR-8998 allows to count 
> child field facet grouping hits by parents, ie hitting every parent only once.
> h2. Problem
> How to count only dominating child field value. ie if a product has 5 Red 
> skus and 2 Blue, it contributes {{Red(1)}}, {{Blue(0)}}
> h2. Suggestion
> Introduce {{dominatingBlock(_root_)}} which aggregate hits per parent, 
> chooses the dominating one and incs only it.
> h2. Further Work
> Judge dominating value not by number of child hits, but by the given function 
> value. Like pick the most popular, best selling, random child field value as 
> dominating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7996) Should we require positive scores?

2018-10-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649521#comment-16649521
 ] 

Yonik Seeley commented on LUCENE-7996:
--

bq.  If we don't require non-negative scores, then we would need some way for 
scorers to tell whether they may produce negative scores 

I assumed we already had logic to disable the optimizations for certain 
scorers.  For example, isn't it true that if I embed an arbitrary function 
query today (even one with all positive scores), these optimizations are 
already disabled?

> Should we require positive scores?
> --
>
> Key: LUCENE-7996
> URL: https://issues.apache.org/jira/browse/LUCENE-7996
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (8.0)
>
> Attachments: LUCENE-7996.patch, LUCENE-7996.patch, LUCENE-7996.patch
>
>
> Having worked on MAXSCORE recently, things would be simpler if we required 
> that scores are positive. Practically, this would mean 
>  - forbidding/fixing similarities that may produce negative scores (we have 
> some of them)
>  - forbidding things like negative boosts
> So I'd be curious to have opinions whether this would be a sane requirement 
> or whether we need to be able to cope with negative scores eg. because some 
> similarities that we want to support produce negative scores by design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12839) add a 'resort' option to JSON faceting

2018-10-14 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649387#comment-16649387
 ] 

Yonik Seeley commented on SOLR-12839:
-

bq. if "foo desc" is the primary sort, and "bar asc" is the tiebreaker, then 
what is being resorted on?

 "foo desc, bar asc 50" was an example of a single sort with tiebreak and a 
limit (no resort).
If one wanted a single string version ";" would be the divider.  For example 
adding a resort with a tiebreak: "foo desc, bar asc 50; baz desc, qux asc 10"

bq. why/how/when would it make sense to resort multiple times?
If there are use cases for starting with N sorted things and reducing that to K 
with a different sort, then it's just sort of recursive.  Why would there be 
use cases for one resort and not two resorts?

One use case that comes to mind are stock screens I've seen that consist of 
multiple sorting and "take top N" steps.
Example: Sort by current dividend yield and take the top 100, then sort those 
by low PE and take the top 50, then sort those by total return 1 year and take 
the top 10.

bq. or how it could work practically given the 2 phrase nature of distributed 
facet refinement.

Hmm, good point.  Over the long term I'd always imagined the number of phases 
could be variable, so It's more of a current implementation detail (albeit a 
very major one).  It would currently kill the usefulness in distributed though. 
 
Anyway we don't have to worry about multiple resorts now as long as we can 
unambiguously upgrade if desired later (i.e. whatever the resort spec looks 
like, if we can unambiguously wrap an array around it later and specify 
multiple of them, then we're good)

> add a 'resort' option to JSON faceting
> --
>
> Key: SOLR-12839
> URL: https://issues.apache.org/jira/browse/SOLR-12839
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-12839.patch, SOLR-12839.patch
>
>
> As discusssed in SOLR-9480 ...
> bq. Similar to how the {{rerank}} request param allows people to collect & 
> score documents using a "cheap" query, and then re-score the top N using a 
> ore expensive query, I think it would be handy if JSON Facets supported a 
> {{resort}} option that could be used on any FacetRequestSorted instance right 
> along side the {{sort}} param, using the same JSON syntax, so that clients 
> could have Solr internaly sort all the facet buckets by something simple 
> (like count) and then "Re-Sort" the top N=limit (or maybe ( 
> N=limit+overrequest ?) using a more expensive function like skg()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12839) add a 'resort' option to JSON faceting

2018-10-13 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649244#comment-16649244
 ] 

Yonik Seeley commented on SOLR-12839:
-

We should perhaps think about how to extend to N sorts instead of 2.
Also keeping in mind that sort should be able to have tiebreaks someday.

Brainstorming syntax:
Maybe just append a number to our existing sort syntax, so we would get 
something like "foo desc, bar asc 50" (bar would be a tiebreak in this case)
So two resorts in a row could be
  "field1 asc 100; field2 desc 10" or a slightly more decomposed array ["field1 
asc 100","field2 desc 10"]
Or given that this is just an extension of the sort syntax, it could even just 
go in the "sort" param itself and not bother with "resort"
sort:"count desc 5" could be a synonym for sort:"count desc",limit:5

It's late and my slides for Activate aren't done take it for what it's 
worth ;-)

> add a 'resort' option to JSON faceting
> --
>
> Key: SOLR-12839
> URL: https://issues.apache.org/jira/browse/SOLR-12839
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-12839.patch, SOLR-12839.patch
>
>
> As discusssed in SOLR-9480 ...
> bq. Similar to how the {{rerank}} request param allows people to collect & 
> score documents using a "cheap" query, and then re-score the top N using a 
> ore expensive query, I think it would be handy if JSON Facets supported a 
> {{resort}} option that could be used on any FacetRequestSorted instance right 
> along side the {{sort}} param, using the same JSON syntax, so that clients 
> could have Solr internaly sort all the facet buckets by something simple 
> (like count) and then "Re-Sort" the top N=limit (or maybe ( 
> N=limit+overrequest ?) using a more expensive function like skg()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet

2018-10-13 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649083#comment-16649083
 ] 

Yonik Seeley commented on SOLR-12325:
-

Yep, this should be pretty easy to do, following the same type of strategy as 
uniqueBlock.
I wish we had named parameters for the function parser already... then we could 
use uniqueBlock(parents=type:product) and avoid another function name.

> introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet
> --
>
> Key: SOLR-12325
> URL: https://issues.apache.org/jira/browse/SOLR-12325
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Major
>
> It might be faster twin for {{uniqueBlock(\_root_)}}. Please utilise buildin 
> query parsing method, don't invent your own. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7996) Should we require positive scores?

2018-10-10 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645147#comment-16645147
 ] 

Yonik Seeley commented on LUCENE-7996:
--

bq. WAND and other optimizations were the reason why I opened this issue and 
moved it forward

I understand why we wouldn't want to produce negative scores by default, as 
that would complicate or prevent such optimizations by default.
What I don't understand is what we gain by prohibiting negative scores across 
the board.  We can only do these optimizations in certain cases anyway, so we 
don't gain anything by prohibiting a function query (for example) from 
producing negative values.  This would seem to limit the use cases without any 
corresponding gain in optimization opportunities.


> Should we require positive scores?
> --
>
> Key: LUCENE-7996
> URL: https://issues.apache.org/jira/browse/LUCENE-7996
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (8.0)
>
> Attachments: LUCENE-7996.patch, LUCENE-7996.patch, LUCENE-7996.patch
>
>
> Having worked on MAXSCORE recently, things would be simpler if we required 
> that scores are positive. Practically, this would mean 
>  - forbidding/fixing similarities that may produce negative scores (we have 
> some of them)
>  - forbidding things like negative boosts
> So I'd be curious to have opinions whether this would be a sane requirement 
> or whether we need to be able to cope with negative scores eg. because some 
> similarities that we want to support produce negative scores by design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7996) Should we require positive scores?

2018-10-07 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641096#comment-16641096
 ] 

Yonik Seeley commented on LUCENE-7996:
--

{quote}Agreed some users are going to be annoyed by the impact of this change. 
I wouldn't have considered it if it wasn't a requirement to get speedups in the 
order of what we are observing on LUCENE-4100 and LUCENE-7993.
{quote}

But maxscore/impact optimizations can only be used in certain circumstances 
anyway, right?  Given that we need fallback to score-all for things that aren't 
supported, falling back rather than prohibiting negative scores would avoid the 
back compat breaks.

> Should we require positive scores?
> --
>
> Key: LUCENE-7996
> URL: https://issues.apache.org/jira/browse/LUCENE-7996
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (8.0)
>
> Attachments: LUCENE-7996.patch, LUCENE-7996.patch, LUCENE-7996.patch
>
>
> Having worked on MAXSCORE recently, things would be simpler if we required 
> that scores are positive. Practically, this would mean 
>  - forbidding/fixing similarities that may produce negative scores (we have 
> some of them)
>  - forbidding things like negative boosts
> So I'd be curious to have opinions whether this would be a sane requirement 
> or whether we need to be able to cope with negative scores eg. because some 
> similarities that we want to support produce negative scores by design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12820) Auto pick method:dvhash based on thresholds

2018-10-03 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636921#comment-16636921
 ] 

Yonik Seeley commented on SOLR-12820:
-

bq. // Trying to find the cardinality for the matchingDocs would be expensive.

The heuristic I had in mind would just use the cardinality of the whole field 
in conjunction with fcontext.base.size()
For example, if one is faceting on US states (50 values) you're pretty much 
always going to want to use the array approach.  Comparing to maxDoc isn't too 
meaningful here.

Even though it may not be implemented yet, we should also keep multi-valued 
fields in mind when thinking about the API access/control for this.

> Auto pick method:dvhash based on thresholds
> ---
>
> Key: SOLR-12820
> URL: https://issues.apache.org/jira/browse/SOLR-12820
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Varun Thacker
>Priority: Major
>
> I've worked with two users last week where explicitly using method:dvhash 
> improved the faceting speeds drastically.
> The common theme in both the use-cases were:  One collection hosting data for 
> multiple users.  We always filter documents for one user ( therby limiting 
> the number of documents drastically ) and then perfoming a complex nested 
> JSON facet.
> Both use-cases fit perfectly in this criteria that [~yo...@apache.org] 
> mentioed on SOLR-9142
> {quote}faceting on a string field with a high cardinality compared to it's 
> domain is less efficient than it could be.
> {quote}
> And DVHASH was the perfect optimization for these use-cases.
> We are using the facet stream expression in one of the use-cases which 
> doesn't expose the method param. We could expose the method param to facet 
> stream but I feel the better approach to solve this problem would be to 
> address this TODO in the code withing the JSON Facet Module
> {code:java}
>   if (mincount > 0 && prefix == null && (ntype != null || method == 
> FacetMethod.DVHASH)) {
>     // TODO can we auto-pick for strings when term cardinality is much 
> greater than DocSet cardinality?
>     //   or if we don't know cardinality but DocSet size is very small
>     return new FacetFieldProcessorByHashDV(fcontext, this, sf);{code}
> I thought about this a little and this was the approach I am thinking 
> currently to tackle this problem
> {code:java}
> int matchingDocs = fcontext.base.size();
> int totalDocs = fcontext.searcher.getIndexReader().maxDoc();
> //if matchingDocs is close to the totalDocs then we aren't filtering many 
> documents.
> //that means the array approach would probably be better than the dvhash 
> approach
> //Trying to find the cardinality for the matchingDocs would be expensive.
> //Also for totalDocs we don't have a global cardinality present at index time 
> but we have a per segment cardinality
> //So using the number of matches as an alternate heuristic would do the job 
> here?{code}
> Any thoughts if this approach makes sense? it could be I'm thinking of this 
> approach just because both the users I worked with last week fell in this 
> cateogory.
>  
> cc [~dsmiley] [~joel.bernstein]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8335) HdfsLockFactory does not allow core to come up after a node was killed

2018-09-21 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624023#comment-16624023
 ] 

Yonik Seeley commented on SOLR-8335:


OK, so for this attached patch, it looks like keeping the lock requires 
touching it periodically (like a lease).
I'm not enough of an expert on HDFS intricacies to know if this is the best 
approach, but this patch has gone a year w/ no feedback.  Anyone have anything 
to add around if this is the right approach or not?

 It's probably best not to introduce new dependencies (hamcrest) along with a 
patch unless they are really necessary though.

> HdfsLockFactory does not allow core to come up after a node was killed
> --
>
> Key: SOLR-8335
> URL: https://issues.apache.org/jira/browse/SOLR-8335
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 5.0, 5.1, 5.2, 5.2.1, 5.3, 5.3.1
>Reporter: Varun Thacker
>Assignee: Mark Miller
>Priority: Major
> Attachments: SOLR-8335.patch
>
>
> When using HdfsLockFactory if a node gets killed instead of a graceful 
> shutdown the write.lock file remains in HDFS . The next time you start the 
> node the core doesn't load up because of LockObtainFailedException .
> I was able to reproduce this in all 5.x versions of Solr . The problem wasn't 
> there when I tested it in 4.10.4
> Steps to reproduce this on 5.x
> 1. Create directory in HDFS : {{bin/hdfs dfs -mkdir /solr}}
> 2. Start Solr: {{bin/solr start -Dsolr.directoryFactory=HdfsDirectoryFactory 
> -Dsolr.lock.type=hdfs -Dsolr.data.dir=hdfs://localhost:9000/solr 
> -Dsolr.updatelog=hdfs://localhost:9000/solr}}
> 3. Create core: {{./bin/solr create -c test -n data_driven}}
> 4. Kill solr
> 5. The lock file is there in HDFS and is called {{write.lock}}
> 6. Start Solr again and you get a stack trace like this:
> {code}
> 2015-11-23 13:28:04.287 ERROR (coreLoadExecutor-6-thread-1) [   x:test] 
> o.a.s.c.CoreContainer Error creating core [test]: Index locked for write for 
> core 'test'. Solr now longer supports forceful unlocking via 
> 'unlockOnStartup'. Please verify locks manually!
> org.apache.solr.common.SolrException: Index locked for write for core 'test'. 
> Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please 
> verify locks manually!
> at org.apache.solr.core.SolrCore.(SolrCore.java:820)
> at org.apache.solr.core.SolrCore.(SolrCore.java:659)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:723)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:443)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:434)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.lucene.store.LockObtainFailedException: Index locked 
> for write for core 'test'. Solr now longer supports forceful unlocking via 
> 'unlockOnStartup'. Please verify locks manually!
> at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:528)
> at org.apache.solr.core.SolrCore.(SolrCore.java:761)
> ... 9 more
> 2015-11-23 13:28:04.289 ERROR (coreContainerWorkExecutor-2-thread-1) [   ] 
> o.a.s.c.CoreContainer Error waiting for SolrCore to be created
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core [test]
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.solr.core.CoreContainer$2.run(CoreContainer.java:472)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.solr.common.SolrException: Unable to create core [test]
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:737)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:443)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:434)
> ... 5 more
> Caused by: 

[jira] [Commented] (LUCENE-8511) MultiFields.getIndexedFields can be optimized to not use getMergedFieldInfos

2018-09-20 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622786#comment-16622786
 ] 

Yonik Seeley commented on LUCENE-8511:
--

Looks good, +1 to avoiding getMergedFieldInfos() here!

> MultiFields.getIndexedFields can be optimized to not use getMergedFieldInfos
> 
>
> Key: LUCENE-8511
> URL: https://issues.apache.org/jira/browse/LUCENE-8511
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-8511.patch, LUCENE-8511.patch
>
>
> MultiFields.getIndexedFields calls getMergedFieldInfos.  But 
> getMergedFieldInfos is kinda heavy, doing all sorts of stuff that 
> getIndexedFields doesn't care about.  It can simply loop the leaf readers and 
> collect the results into a Set.  Java 8 streams should make easy work of this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11836) Use -1 in bucketSizeLimit to get all facets, analogous to the JSON facet API

2018-09-11 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611275#comment-16611275
 ] 

Yonik Seeley commented on SOLR-11836:
-

limit:-1 should work fine for JSON Facets.

bq. Also when I sent -1 directly to the JSON facet API it didn't return 
results. I'll need to dig into why.
Perhaps other code in the middle (i.e. before it gets to the JSON Facet code) 
manipulates that value and messes it up?

TestJsonFacets randomly specifies limit:-1 so this should be well tested too:
https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/search/facet/TestJsonFacets.java#L935


> Use -1 in bucketSizeLimit to get all facets, analogous to the JSON facet API
> 
>
> Key: SOLR-11836
> URL: https://issues.apache.org/jira/browse/SOLR-11836
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Alfonso Muñoz-Pomer Fuentes
>Priority: Major
>  Labels: facet, streaming
> Attachments: SOLR-11836.patch
>
>
> Currently, to retrieve all buckets using the streaming expressions facet 
> function, the {{bucketSizeLimit}} parameter must have a high enough value so 
> that all results will be included. Compare this with the JSON facet API, 
> where you can use {{"limit": -1}} to achieve this. It would help if such a 
> possibility existed.
> [Issue 11236|https://issues.apache.org/jira/browse/SOLR-11236] is related.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11598) Export Writer needs to support more than 4 Sort fields - Say 10, ideally it should not be bound at all, but 4 seems to really short sell the StreamRollup capabilities.

2018-07-17 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546851#comment-16546851
 ] 

Yonik Seeley commented on SOLR-11598:
-

In general, we shouldn't have limits at all on stuff like this.  If the 
performance degradation and memory use is linear, there is no trap waiting to 
bite someone (except for the arbitrary limit itself).


> Export Writer needs to support more than 4 Sort fields - Say 10, ideally it 
> should not be bound at all, but 4 seems to really short sell the StreamRollup 
> capabilities.
> ---
>
> Key: SOLR-11598
> URL: https://issues.apache.org/jira/browse/SOLR-11598
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Affects Versions: 6.6.1, 7.0
>Reporter: Aroop
>Assignee: Varun Thacker
>Priority: Major
>  Labels: patch
> Attachments: SOLR-11598-6_6-streamtests, SOLR-11598-6_6.patch, 
> SOLR-11598-master.patch, SOLR-11598.patch, SOLR-11598.patch, 
> SOLR-11598.patch, SOLR-11598.patch, SOLR-11598.patch, SOLR-11598.patch, 
> streaming-export reports.xlsx
>
>
> I am a user of Streaming and I am currently trying to use rollups on an 10 
> dimensional document.
> I am unable to get correct results on this query as I am bounded by the 
> limitation of the export handler which supports only 4 sort fields.
> I do not see why this needs to be the case, as it could very well be 10 or 20.
> My current needs would be satisfied with 10, but one would want to ask why 
> can't it be any decent integer n, beyond which we know performance degrades, 
> but even then it should be caveat emptor.
> [~varunthacker] 
> Code Link:
> https://github.com/apache/lucene-solr/blob/19db1df81a18e6eb2cce5be973bf2305d606a9f8/solr/core/src/java/org/apache/solr/handler/ExportWriter.java#L455
> Error
> null:java.io.IOException: A max of 4 sorts can be specified
>   at 
> org.apache.solr.handler.ExportWriter.getSortDoc(ExportWriter.java:452)
>   at org.apache.solr.handler.ExportWriter.writeDocs(ExportWriter.java:228)
>   at 
> org.apache.solr.handler.ExportWriter.lambda$null$1(ExportWriter.java:219)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeIterator(JavaBinCodec.java:664)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:333)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:223)
>   at org.apache.solr.common.util.JavaBinCodec$1.put(JavaBinCodec.java:394)
>   at 
> org.apache.solr.handler.ExportWriter.lambda$null$2(ExportWriter.java:219)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeMap(JavaBinCodec.java:437)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:354)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:223)
>   at org.apache.solr.common.util.JavaBinCodec$1.put(JavaBinCodec.java:394)
>   at 
> org.apache.solr.handler.ExportWriter.lambda$write$3(ExportWriter.java:217)
>   at 
> org.apache.solr.common.util.JavaBinCodec.writeMap(JavaBinCodec.java:437)
>   at org.apache.solr.handler.ExportWriter.write(ExportWriter.java:215)
>   at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2601)
>   at 
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
>   at 
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:809)
>   at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:538)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
>   at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>   at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> 

[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-09 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537904#comment-16537904
 ] 

Yonik Seeley commented on SOLR-12343:
-

Looks good, thanks for tracking that down!

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-08 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536095#comment-16536095
 ] 

Yonik Seeley commented on SOLR-12343:
-

I'm occasionally getting a failure in 
testSortedFacetRefinementPushingNonRefinedBucketBackIntoTopN
I haven't tried digging into it yet though.

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-05 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534377#comment-16534377
 ] 

Yonik Seeley commented on SOLR-12343:
-

bq. it will stop returning the facet range "other" buckets completely since 
currently no code refines them at all

Hmmm, so the patch I attached seems like it would only remove incomplete 
buckets in field facets under "other" buckets (i.e. if they don't actually need 
refining to be complete, they won't be removed by the current patch).  But this 
could still be worse in some cases (missing vs incomplete when refinement is 
requested), so I agree this can wait until  SOLR-12516 is done. 

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch, 
> SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-05 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reassigned SOLR-12343:
---

Assignee: Yonik Seeley

> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-12533) Collection creation fails if metrics are called during core creation

2018-07-03 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-12533.
-
   Resolution: Fixed
Fix Version/s: 7.5

> Collection creation fails if metrics are called during core creation
> 
>
> Key: SOLR-12533
> URL: https://issues.apache.org/jira/browse/SOLR-12533
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.0
>Reporter: Peter Cseh
>Priority: Major
> Fix For: 7.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a race condition in SorlCore's constructor:
> - the metrics.indexSize call implicitly creates a data/index folder for that 
> core
> - if the data/index folder exists, no segments file will be created
> - the searcher won't start up if there are no segments file in the data/index 
> folder
> This is probably the root cause for SOLR-2130 and SOLR-2801 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8378) Add DocIdSetIterator.range method

2018-07-03 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532010#comment-16532010
 ] 

Yonik Seeley edited comment on LUCENE-8378 at 7/3/18 10:18 PM:
---

I assume it's a bug that minDoc is always returned?
edit: oops, sorry, I missed the "static" in the method signature.  I thought 
this was providing a slice of another iterator for a minute.


was (Author: ysee...@gmail.com):
I assume it's a bug that minDoc is always returned?


> Add DocIdSetIterator.range method
> -
>
> Key: LUCENE-8378
> URL: https://issues.apache.org/jira/browse/LUCENE-8378
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8378.patch, LUCENE-8378.patch
>
>
> We already have {{DocIdSetIterator.all}} and {{DocIdSetIterator.empty}} but 
> I'd like to also add a {{range}} method to match a specified range of docids.
> E.g. this can be useful if you sort your index by a key, and then create a 
> custom query to match documents by values for that key, or by range 
> (LUCENE-7714).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8378) Add DocIdSetIterator.range method

2018-07-03 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532010#comment-16532010
 ] 

Yonik Seeley commented on LUCENE-8378:
--

I assume it's a bug that minDoc is always returned?


> Add DocIdSetIterator.range method
> -
>
> Key: LUCENE-8378
> URL: https://issues.apache.org/jira/browse/LUCENE-8378
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8378.patch, LUCENE-8378.patch
>
>
> We already have {{DocIdSetIterator.all}} and {{DocIdSetIterator.empty}} but 
> I'd like to also add a {{range}} method to match a specified range of docids.
> E.g. this can be useful if you sort your index by a key, and then create a 
> custom query to match documents by values for that key, or by range 
> (LUCENE-7714).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

2018-07-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530657#comment-16530657
 ] 

Yonik Seeley commented on SOLR-12343:
-

I think some of what I just worked on for SOLR-12326 is related to (or can be 
used by) this issue.
FacetRequestSortedMerger now has a "BitSet shardHasMoreBuckets" to help deal 
with the fact that complete buckets do not need participation from every shard. 
 That info in conjunction with Context.sawShard should be enough to tell if a 
bucket is already "complete".
For every bucket that isn't complete, we can either refine it, or drop it.


> JSON Field Facet refinement can return incorrect counts/stats for sorted 
> buckets
> 
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement 
> can cause _refined_ buckets to be "bumped out" of the topN based on the 
> refined counts/stats depending on the sort - causing _unrefined_ buckets 
> originally discounted in phase#2 to bubble up into the topN and be returned 
> to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a 
> {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low 
> shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very 
> high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of 
> phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets 
> then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete 
> count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a 
> significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the 
> client have counts/stats that are the cumulation of all shards, but termY 
> only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. 
> Additional overrequest just increases the number of "extra" terms needed in 
> the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement 
> can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) 
> asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12533) Collection creation fails if metrics are called during core creation

2018-07-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530356#comment-16530356
 ] 

Yonik Seeley commented on SOLR-12533:
-

These changes look good to me.  I plan on committing after unit tests finish 
running.

> Collection creation fails if metrics are called during core creation
> 
>
> Key: SOLR-12533
> URL: https://issues.apache.org/jira/browse/SOLR-12533
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.0
>Reporter: Peter Cseh
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a race condition in SorlCore's constructor:
> - the metrics.indexSize call implicitly creates a data/index folder for that 
> core
> - if the data/index folder exists, no segments file will be created
> - the searcher won't start up if there are no segments file in the data/index 
> folder
> This is probably the root cause for SOLR-2130 and SOLR-2801 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-12326) Unnecessary refinement requests

2018-06-30 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-12326.
-
   Resolution: Fixed
Fix Version/s: 7.5

> Unnecessary refinement requests
> ---
>
> Key: SOLR-12326
> URL: https://issues.apache.org/jira/browse/SOLR-12326
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.5
>
> Attachments: SOLR-12326.patch, SOLR-12326.patch
>
>
> TestJsonFacets.testStatsDistrib() appears to result in more refinement 
> requests than would otherwise be expected.  Those tests were developed before 
> refinement was implemented and hence do not need refinement to generate 
> correct results due to limited numbers of buckets.  This should be detectable 
> by refinement code in the majority of cases to prevent extra work from being 
> done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-12326) Unnecessary refinement requests

2018-06-28 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reassigned SOLR-12326:
---

  Assignee: Yonik Seeley
Attachment: SOLR-12326.patch

Draft patch attached.  TestJsonFacetRefinement still fails, I assume because 
not all field faceting implementations return "more" yet.  More tests to be 
added as well.

> Unnecessary refinement requests
> ---
>
> Key: SOLR-12326
> URL: https://issues.apache.org/jira/browse/SOLR-12326
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12326.patch
>
>
> TestJsonFacets.testStatsDistrib() appears to result in more refinement 
> requests than would otherwise be expected.  Those tests were developed before 
> refinement was implemented and hence do not need refinement to generate 
> correct results due to limited numbers of buckets.  This should be detectable 
> by refinement code in the majority of cases to prevent extra work from being 
> done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12326) Unnecessary refinement requests

2018-06-21 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519493#comment-16519493
 ] 

Yonik Seeley commented on SOLR-12326:
-

One part of the solution is for the request merger to know if a shard has more 
buckets.  If it knows the exact amount of over-request used, then it can figure 
it out.  This is a little more fragile though, and I could envision future 
optimizations that dynamically change the amount of over-request based on 
things like heuristics, field statistics on that shard, and results of previous 
requests.   For that reason, I'm planning on just passing back more:true for 
field facets that have more values.

> Unnecessary refinement requests
> ---
>
> Key: SOLR-12326
> URL: https://issues.apache.org/jira/browse/SOLR-12326
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Reporter: Yonik Seeley
>Priority: Major
>
> TestJsonFacets.testStatsDistrib() appears to result in more refinement 
> requests than would otherwise be expected.  Those tests were developed before 
> refinement was implemented and hence do not need refinement to generate 
> correct results due to limited numbers of buckets.  This should be detectable 
> by refinement code in the majority of cases to prevent extra work from being 
> done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8359) Extend ToParentBlockJoinQuery with 'minimum matched children' functionality

2018-06-15 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513889#comment-16513889
 ] 

Yonik Seeley commented on LUCENE-8359:
--

I haven't had a chance to look at the patch, but +1 for the idea of adding the 
high level functionality!

> Extend ToParentBlockJoinQuery with 'minimum matched children' functionality 
> 
>
> Key: LUCENE-8359
> URL: https://issues.apache.org/jira/browse/LUCENE-8359
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Andrey Kudryavtsev
>Priority: Minor
>  Labels: lucene
> Attachments: LUCENE-8359
>
>
> I have a hierarchal data in index and requirements like 'match parent only if 
> at least {{n}} his children were matched'.  
> I used to solve it by combination of some lucene / solr tricks like 'frange' 
> filtration by sum of matched children score, so it's doable out of the box 
> with some efforts right now. But also it could be solved by 
> \{{ToParentBlockJoinQuery}} extension with new numeric parameter, tried to do 
> it in attached patch. 
> Not sure if this should be in main branch, just put it here, maybe someone 
> would have similar problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9685) tag a query in JSON syntax

2018-06-13 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511238#comment-16511238
 ] 

Yonik Seeley commented on SOLR-9685:


bq. I'm confused about what's happening here as this was resolved again without 
the docs being updated
I had reopened the issue to fix the bug that was found (not for the docs), and 
resolved again after the fix was committed.

> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685-doc.patch, 
> SOLR-9685.patch, SOLR-9685.patch, SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-9685) tag a query in JSON syntax

2018-06-12 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-9685.

Resolution: Fixed

OK, I also modified a test to test for the {"#tag":{"lucene" case.  
Right now, excludeTags only works on top-level filters, so we can only test 
that the syntax works for now on these sub-queries I think.

> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685-doc.patch, 
> SOLR-9685.patch, SOLR-9685.patch, SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9685) tag a query in JSON syntax

2018-06-12 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509660#comment-16509660
 ] 

Yonik Seeley commented on SOLR-9685:


Attached draft patch to fix the issue of tagged queries on sub-parsers.


> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685.patch, SOLR-9685.patch, 
> SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9685) tag a query in JSON syntax

2018-06-12 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-9685:
---
Attachment: SOLR-9685.patch

> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685.patch, SOLR-9685.patch, 
> SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11779) Basic long-term collection of aggregated metrics

2018-06-12 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509609#comment-16509609
 ] 

Yonik Seeley commented on SOLR-11779:
-

I'd consider it a minor bug for our default configs to be throwing exceptions 
by default when nothing is wrong.
I'd suggest that this should not be a WARN level message (and definitely 
shouldn't log an exception).
The text of the log message could be changed to remove the word Error as well, 
since it's not an error case.

Perhaps "No .system collection, keeping metrics history in memory"

 

> Basic long-term collection of aggregated metrics
> 
>
> Key: SOLR-11779
> URL: https://issues.apache.org/jira/browse/SOLR-11779
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 7.3, master (8.0)
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-11779.patch, SOLR-11779.patch, SOLR-11779.patch, 
> SOLR-11779.patch, c1.png, c2.png, core.json, d1.png, d2.png, d3.png, 
> jvm-list.json, jvm-string.json, jvm.json, o1.png, u1.png
>
>
> Tracking the key metrics over time is very helpful in understanding the 
> cluster and user behavior.
> Currently even basic metrics tracking requires setting up an external system 
> and either polling {{/admin/metrics}} or using {{SolrMetricReporter}}-s. The 
> advantage of this setup is that these external tools usually provide a lot of 
> sophisticated functionality. The downside is that they don't ship out of the 
> box with Solr and require additional admin effort to set up.
> Solr could collect some of the key metrics and keep their historical values 
> in a round-robin database (eg. using RRD4j) to keep the size of the historic 
> data constant (eg. ~64kB per metric), but at the same providing out of the 
> box useful insights into the basic system behavior over time. This data could 
> be persisted to the {{.system}} collection as blobs, and it could be also 
> presented in the Admin UI as graphs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11779) Basic long-term collection of aggregated metrics

2018-06-11 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509136#comment-16509136
 ] 

Yonik Seeley commented on SOLR-11779:
-

I don't know if it's this issue or a related issue, but all basic tests as well 
as "bin/solr start" now throw the following exception:
{code}
2018-06-12 03:45:57.146 WARN  (main) [   ] o.a.s.h.a.MetricsHistoryHandler 
Error querying .system collection, keeping metrics history in memory
org.apache.solr.common.SolrException: No such core: .system
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:161)
 ~[solr-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:07]
at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194) 
~[solr-solrj-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:12]
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942) 
~[solr-solrj-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:12]
at 
org.apache.solr.handler.admin.MetricsHistoryHandler.checkSystemCollection(MetricsHistoryHandler.java:282)
 [solr-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:07]
at 
org.apache.solr.handler.admin.MetricsHistoryHandler.(MetricsHistoryHandler.java:235)
 [solr-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:07]
at 
org.apache.solr.core.CoreContainer.createMetricsHistoryHandler(CoreContainer.java:780)
 [solr-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:07]
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:578) 
[solr-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:07]
at 
org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:252)
 [solr-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:07]
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:172) 
[solr-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT 
7773bf67643a152e1d12bed253345a40ef14f0e9 - yonik - 2018-06-11 20:14:07]
at 
org.eclipse.jetty.servlet.FilterHolder.initialize(FilterHolder.java:139) 
[jetty-servlet-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:741) 
[jetty-servlet-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:374)
 [jetty-servlet-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1497) 
[jetty-webapp-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1459) 
[jetty-webapp-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:785)
 [jetty-server-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:287)
 [jetty-servlet-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:545) 
[jetty-webapp-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
 [jetty-util-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:46)
 [jetty-deploy-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:192) 
[jetty-deploy-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:505)
 [jetty-deploy-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:151) 
[jetty-deploy-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:180)
 [jetty-deploy-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.deploy.providers.WebAppProvider.fileAdded(WebAppProvider.java:453)
 [jetty-deploy-9.4.10.v20180503.jar:9.4.10.v20180503]
at 
org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:64)
 [jetty-deploy-9.4.10.v20180503.jar:9.4.10.v20180503]
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:610) 

[jira] [Commented] (SOLR-9685) tag a query in JSON syntax

2018-06-11 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509052#comment-16509052
 ] 

Yonik Seeley commented on SOLR-9685:


Stepping through with the debugger, it looks like this is the type of 
local-params string being built:
{code}
{!bool should={!tag=MYTAG}id:1 should=$_tt0 }
{code}

So we need to use variables for parameters here as well.

> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685.patch, SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9685) tag a query in JSON syntax

2018-06-11 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509002#comment-16509002
 ] 

Yonik Seeley commented on SOLR-9685:


Here's one of the simplest examples of a query that fails to parse:
{code}
curl http://localhost:8983/solr/techproducts/query -d ' {
  query:{bool:{
must:{"#TOP" : "text:memory"}
  }}
}'
{code}

{code}
{
  "responseHeader":{
"status":400,
"QTime":8,
"params":{
  "json":" {\n  query:{bool:{\nmust:{\"#TOP\" : \"text:memory\"}\n  
}}\n}"}},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.search.SyntaxError"],
"msg":"org.apache.solr.search.SyntaxError: Missing end to unquoted value 
starting at 6 str='{!tag=TOP'",
"code":400}}
{code}

> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685.patch, SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-9685) tag a query in JSON syntax

2018-06-11 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reopened SOLR-9685:


> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685.patch, SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9685) tag a query in JSON syntax

2018-06-11 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508753#comment-16508753
 ] 

Yonik Seeley commented on SOLR-9685:


Looks like escaping bugs when producing the local-params variant from the JSON 
one.
If possible, this should be fixed for 7.4.

> tag a query in JSON syntax
> --
>
> Key: SOLR-9685
> URL: https://issues.apache.org/jira/browse/SOLR-9685
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module, JSON Request API
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-9685-doc.patch, SOLR-9685.patch, SOLR-9685.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There should be a way to tag a query/filter in JSON syntax.
> Perhaps these two forms could be equivalent:
> {code}
> "{!tag=COLOR}color:blue"
> { tagged : { COLOR : "color:blue" }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5211) updating parent as childless makes old children orphans

2018-06-03 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499542#comment-16499542
 ] 

Yonik Seeley commented on SOLR-5211:


If we check for \_root\_ in the index, everything could be back compat (and 
avoid the need for schema update + reindex).

If parent-child docs are being used, then updates could use 2 update terms (one 
for id and one for \_root\_)

> updating parent as childless makes old children orphans
> ---
>
> Key: SOLR-5211
> URL: https://issues.apache.org/jira/browse/SOLR-5211
> Project: Solr
>  Issue Type: Sub-task
>  Components: update
>Affects Versions: 4.5, 6.0
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Attachments: SOLR-5211.patch, SOLR-5211.patch
>
>
> if I have parent with children in the index, I can send update omitting 
> children. as a result old children become orphaned. 
> I suppose separate \_root_ fields makes much trouble. I propose to extend 
> notion of uniqueKey, and let it spans across blocks that makes updates 
> unambiguous.  
> WDYT? Do you like to see a test proves this issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12366) Avoid SlowAtomicReader.getLiveDocs -- it's slow

2018-06-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499147#comment-16499147
 ] 

Yonik Seeley commented on SOLR-12366:
-

Nice catch, this stuff has been broken forever!
 Looking back, I think not enough was exposed to be able to work per-segment, 
so Lucene's MultiReader.isDeleted(int doc) did a binary search each time. Once 
we gained the ability to operate per-segment, some code wasn't converted.
{quote}IMO some callers of SolrIndexSearcher.getSlowAtomicReader should change 
to use MultiFields to avoid the temptation to have a LeafReader that has many 
slow methods.
{quote}
MultiFields has slow methods as well, and if you look at the histories, many 
places used MultiFields.getDeletedDocs even before (and were replaced with the 
equivalent?)
 For example, commit 6ffc159b40 changed getFirstMatch to use 
MultiFields.getDeletedDocs (which may not have been a bug since it probably was 
equivalent at the time?)

Anyway, I think perhaps we should throw an exception for any place in 
SlowCompositeReaderWrapper that exposes code that does a binary search. We 
don't need a full Reader implementation here I think.

A variable name change for "SolrIndexSearcher.leafReader" would really be 
welcome too... it's a bad name.  We've been bit by the naming before as well: 
SOLR-9592

> Avoid SlowAtomicReader.getLiveDocs -- it's slow
> ---
>
> Key: SOLR-12366
> URL: https://issues.apache.org/jira/browse/SOLR-12366
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12366.patch, SOLR-12366.patch, SOLR-12366.patch, 
> SOLR-12366.patch
>
>
> SlowAtomicReader is of course slow, and it's getLiveDocs (based on MultiBits) 
> is slow as it uses a binary search for each lookup.  There are various places 
> in Solr that use SolrIndexSearcher.getSlowAtomicReader and then get the 
> liveDocs.  Most of these places ought to work with SolrIndexSearcher's 
> getLiveDocs method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5211) updating parent as childless makes old children orphans

2018-06-02 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499072#comment-16499072
 ] 

Yonik Seeley commented on SOLR-5211:


bq. Or maybe your comment is how do we handle an existing index before this 
rule existed?

More as an alternative direction that would not require the rule (that every 
document have root), only those with children (as is done today).
We constantly get dinged on usability because of things that require static 
configuration, and this is yet another (that would require reindexing even)


> updating parent as childless makes old children orphans
> ---
>
> Key: SOLR-5211
> URL: https://issues.apache.org/jira/browse/SOLR-5211
> Project: Solr
>  Issue Type: Sub-task
>  Components: update
>Affects Versions: 4.5, 6.0
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Attachments: SOLR-5211.patch
>
>
> if I have parent with children in the index, I can send update omitting 
> children. as a result old children become orphaned. 
> I suppose separate \_root_ fields makes much trouble. I propose to extend 
> notion of uniqueKey, and let it spans across blocks that makes updates 
> unambiguous.  
> WDYT? Do you like to see a test proves this issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5211) updating parent as childless makes old children orphans

2018-06-01 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498627#comment-16498627
 ] 

Yonik Seeley commented on SOLR-5211:


It should be relatively trivial to know if the \_root\_ field exists in the 
index (i.e. when any parent/child groups exist) and do the right thing based on 
that.


 

 

 

> updating parent as childless makes old children orphans
> ---
>
> Key: SOLR-5211
> URL: https://issues.apache.org/jira/browse/SOLR-5211
> Project: Solr
>  Issue Type: Sub-task
>  Components: update
>Affects Versions: 4.5, 6.0
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Attachments: SOLR-5211.patch
>
>
> if I have parent with children in the index, I can send update omitting 
> children. as a result old children become orphaned. 
> I suppose separate \_root_ fields makes much trouble. I propose to extend 
> notion of uniqueKey, and let it spans across blocks that makes updates 
> unambiguous.  
> WDYT? Do you like to see a test proves this issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12374) Add SolrCore.withSearcher(lambda accepting SolrIndexSearcher)

2018-05-30 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495231#comment-16495231
 ] 

Yonik Seeley commented on SOLR-12374:
-

The CHANGES for 7.4 has:
* SOLR-12374: SnapShooter.getIndexCommit can forget to decref the searcher; 
though it's not clear in practice when.
 (David Smiley)

But it's missing on the master branch...

> Add SolrCore.withSearcher(lambda accepting SolrIndexSearcher)
> -
>
> Key: SOLR-12374
> URL: https://issues.apache.org/jira/browse/SOLR-12374
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Fix For: 7.4
>
> Attachments: SOLR-12374.patch
>
>
> I propose adding the following to SolrCore:
> {code:java}
>   /**
>* Executes the lambda with the {@link SolrIndexSearcher}.  This is more 
> convenience than using
>* {@link #getSearcher()} since there is no ref-counting business to worry 
> about.
>* Example:
>* 
>*   IndexReader reader = 
> h.getCore().withSearcher(SolrIndexSearcher::getIndexReader);
>* 
>*/
>   @SuppressWarnings("unchecked")
>   public  R withSearcher(Function lambda) {
> final RefCounted refCounted = getSearcher();
> try {
>   return lambda.apply(refCounted.get());
> } finally {
>   refCounted.decref();
> }
>   }
> {code}
> This is a nice tight convenience method, avoiding the clumsy RefCounted API 
> which is easy to accidentally incorrectly use – see 
> https://issues.apache.org/jira/browse/SOLR-11616?focusedCommentId=16477719=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16477719
> I guess my only (small) concern is if hypothetically you might make the 
> lambda short because it's easy to do that (see the one-liner example above) 
> but the object you return that you're interested in  (say IndexReader) could 
> potentially become invalid if the SolrIndexSearcher closes.  But I think/hope 
> that's impossible normally based on when this getSearcher() used?  I could at 
> least add a warning to the docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-12417) velocity response writer v.json should enforce valid function name

2018-05-29 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reassigned SOLR-12417:
---

Assignee: Yonik Seeley

> velocity response writer v.json should enforce valid function name
> --
>
> Key: SOLR-12417
> URL: https://issues.apache.org/jira/browse/SOLR-12417
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
> Environment: VelocityResponseWriter should enforce that v.json 
> parameter is just a function name
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12417.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12417) velocity response writer v.json should enforce valid function name

2018-05-29 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-12417:

Attachment: SOLR-12417.patch

> velocity response writer v.json should enforce valid function name
> --
>
> Key: SOLR-12417
> URL: https://issues.apache.org/jira/browse/SOLR-12417
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
> Environment: VelocityResponseWriter should enforce that v.json 
> parameter is just a function name
>Reporter: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12417.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12417) velocity response writer v.json should enforce valid function name

2018-05-29 Thread Yonik Seeley (JIRA)
Yonik Seeley created SOLR-12417:
---

 Summary: velocity response writer v.json should enforce valid 
function name
 Key: SOLR-12417
 URL: https://issues.apache.org/jira/browse/SOLR-12417
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
 Environment: VelocityResponseWriter should enforce that v.json 
parameter is just a function name
Reporter: Yonik Seeley






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-12328) Adding graph json facet domain change

2018-05-27 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-12328.
-
   Resolution: Fixed
Fix Version/s: 7.4

> Adding graph json facet domain change
> -
>
> Key: SOLR-12328
> URL: https://issues.apache.org/jira/browse/SOLR-12328
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.3
>Reporter: Daniel Meehl
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12328.patch
>
>
> Json facets now support join queries via domain change. I've made a 
> relatively small enhancement to add graph to the mix. I'll attach a patch for 
> your viewing. I'm hoping this can be merged into solr proper. Please let me 
> know if there are any problems/changes/requirements. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12328) Adding graph json facet domain change

2018-05-27 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492204#comment-16492204
 ] 

Yonik Seeley commented on SOLR-12328:
-

I fixed up the null traversal filter noted, consolidated the tests, and 
committed.  Thanks!

> Adding graph json facet domain change
> -
>
> Key: SOLR-12328
> URL: https://issues.apache.org/jira/browse/SOLR-12328
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.3
>Reporter: Daniel Meehl
>Assignee: Yonik Seeley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12328.patch
>
>
> Json facets now support join queries via domain change. I've made a 
> relatively small enhancement to add graph to the mix. I'll attach a patch for 
> your viewing. I'm hoping this can be merged into solr proper. Please let me 
> know if there are any problems/changes/requirements. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-12328) Adding graph json facet domain change

2018-05-27 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reassigned SOLR-12328:
---

Assignee: Yonik Seeley

> Adding graph json facet domain change
> -
>
> Key: SOLR-12328
> URL: https://issues.apache.org/jira/browse/SOLR-12328
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: 7.3
>Reporter: Daniel Meehl
>Assignee: Yonik Seeley
>Priority: Major
> Attachments: SOLR-12328.patch
>
>
> Json facets now support join queries via domain change. I've made a 
> relatively small enhancement to add graph to the mix. I'll attach a patch for 
> your viewing. I'm hoping this can be merged into solr proper. Please let me 
> know if there are any problems/changes/requirements. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >