Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
Happy to help! If I'm correctly reading the block of code linked to above,
"dvhash" is silently ignored for multi-valued fields. So probably not much
performance difference there ;-)

On Fri, Feb 5, 2021 at 2:12 PM ufuk yılmaz 
wrote:

> This is a huge help Mr. Gibney thank you!
>
> One thing I can add is I tried dvhash with a string multi-valued field, it
> worked and didn’t throw any error but I don’t know if it got silently
> ignored or just worked.
>
> Sent from Mail for Windows 10
>
> From: Michael Gibney
> Sent: 05 February 2021 20:52
> To: solr-user@lucene.apache.org
> Subject: Re: Clarification on term facet method dvhash
>
> Correction!: wrt "dvhash" and numeric types, it looks like I had it exactly
> backwards! single-valued numeric types _do_ use (even default to) "dvhash"
> ... sorry about that! I stand by the rest of the previous message though,
> which applies at a minimum to string-like fields.
>
> On Fri, Feb 5, 2021 at 12:49 PM Michael Gibney 
> wrote:
>
> > > Performance and resource is still affected by 30M unique values of T
> > right?
> > Yes. The main performance issue would be the per-request allocation of a
> > 30M-element `long[]` for "dv" or "uif" methods (which are by far the most
> > common methods in practice). With low enough request volume and large
> > enough heap you might not actually perceive a difference in performance;
> > but if you encounter problems for the use case you describe, this array
> > allocation would likely be the cause. (also note that the relevant field
> > cardinality is the _per-shard_ cardinality, so in a multi-shard
> collection
> > the size of the allocated arrays might be somewhat less than the overall
> > field cardinality)
> >
> > I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
> > moment, but rather must be specified explicitly:
> >
> >
> https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128
> >
> > The code snippet above indicates some other restrictions that you're
> > probably already aware of (doesn't work with prefixes or mincount==0, or
> > for multi-valued or numeric types); otherwise though (for non-numeric
> > single-valued field) I think the situation you describe (high-cardinality
> > field, known low-cardinality for the particular domain) sounds like a
> > perfect use-case for dvhash.
> >
> > Michael
> >
> > On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz  >
> > wrote:
> >
> >> Hello,
> >>
> >> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
> >> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
> >>
> >> As I understand the main determinator of performance and RAM usage of a
> >> terms facet is cardinality of the field in whole collection, but not the
> >> cardinality of field in query result.
> >>
> >> I have a collection with 100M docs, T field has 30M unique values in
> >> entire collection. But my query result returns only docs with 2
> different T
> >> values,
> >>
> >> {
> >> “q”: “some query”, //whose result has only 2 different T values
> >> “facet”: {
> >> “type”: “terms”,
> >> “field”: “T”,
> >> “limit”: 15
> >> }
> >>
> >> Performance and resource is still affected by 30M unique values of T
> >> right?
> >>
> >> If this is correct, can/how “method”: “dvhash” help in this case?
> >> If yes, does the default method “smart” take this into account and use
> >> the dvhash, so I shouldn’t to set it explicitly?
> >>
> >> Nice weekends
> >> ~ufuk
> >>
> >
>
>


RE: Clarification on term facet method dvhash

2021-02-05 Thread ufuk yılmaz
This is a huge help Mr. Gibney thank you!

One thing I can add is I tried dvhash with a string multi-valued field, it 
worked and didn’t throw any error but I don’t know if it got silently ignored 
or just worked.

Sent from Mail for Windows 10

From: Michael Gibney
Sent: 05 February 2021 20:52
To: solr-user@lucene.apache.org
Subject: Re: Clarification on term facet method dvhash

Correction!: wrt "dvhash" and numeric types, it looks like I had it exactly
backwards! single-valued numeric types _do_ use (even default to) "dvhash"
... sorry about that! I stand by the rest of the previous message though,
which applies at a minimum to string-like fields.

On Fri, Feb 5, 2021 at 12:49 PM Michael Gibney 
wrote:

> > Performance and resource is still affected by 30M unique values of T
> right?
> Yes. The main performance issue would be the per-request allocation of a
> 30M-element `long[]` for "dv" or "uif" methods (which are by far the most
> common methods in practice). With low enough request volume and large
> enough heap you might not actually perceive a difference in performance;
> but if you encounter problems for the use case you describe, this array
> allocation would likely be the cause. (also note that the relevant field
> cardinality is the _per-shard_ cardinality, so in a multi-shard collection
> the size of the allocated arrays might be somewhat less than the overall
> field cardinality)
>
> I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
> moment, but rather must be specified explicitly:
>
> https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128
>
> The code snippet above indicates some other restrictions that you're
> probably already aware of (doesn't work with prefixes or mincount==0, or
> for multi-valued or numeric types); otherwise though (for non-numeric
> single-valued field) I think the situation you describe (high-cardinality
> field, known low-cardinality for the particular domain) sounds like a
> perfect use-case for dvhash.
>
> Michael
>
> On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz 
> wrote:
>
>> Hello,
>>
>> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
>> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
>>
>> As I understand the main determinator of performance and RAM usage of a
>> terms facet is cardinality of the field in whole collection, but not the
>> cardinality of field in query result.
>>
>> I have a collection with 100M docs, T field has 30M unique values in
>> entire collection. But my query result returns only docs with 2 different T
>> values,
>>
>> {
>> “q”: “some query”, //whose result has only 2 different T values
>> “facet”: {
>> “type”: “terms”,
>> “field”: “T”,
>> “limit”: 15
>> }
>>
>> Performance and resource is still affected by 30M unique values of T
>> right?
>>
>> If this is correct, can/how “method”: “dvhash” help in this case?
>> If yes, does the default method “smart” take this into account and use
>> the dvhash, so I shouldn’t to set it explicitly?
>>
>> Nice weekends
>> ~ufuk
>>
>



Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
Correction!: wrt "dvhash" and numeric types, it looks like I had it exactly
backwards! single-valued numeric types _do_ use (even default to) "dvhash"
... sorry about that! I stand by the rest of the previous message though,
which applies at a minimum to string-like fields.

On Fri, Feb 5, 2021 at 12:49 PM Michael Gibney 
wrote:

> > Performance and resource is still affected by 30M unique values of T
> right?
> Yes. The main performance issue would be the per-request allocation of a
> 30M-element `long[]` for "dv" or "uif" methods (which are by far the most
> common methods in practice). With low enough request volume and large
> enough heap you might not actually perceive a difference in performance;
> but if you encounter problems for the use case you describe, this array
> allocation would likely be the cause. (also note that the relevant field
> cardinality is the _per-shard_ cardinality, so in a multi-shard collection
> the size of the allocated arrays might be somewhat less than the overall
> field cardinality)
>
> I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
> moment, but rather must be specified explicitly:
>
> https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128
>
> The code snippet above indicates some other restrictions that you're
> probably already aware of (doesn't work with prefixes or mincount==0, or
> for multi-valued or numeric types); otherwise though (for non-numeric
> single-valued field) I think the situation you describe (high-cardinality
> field, known low-cardinality for the particular domain) sounds like a
> perfect use-case for dvhash.
>
> Michael
>
> On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz 
> wrote:
>
>> Hello,
>>
>> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
>> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
>>
>> As I understand the main determinator of performance and RAM usage of a
>> terms facet is cardinality of the field in whole collection, but not the
>> cardinality of field in query result.
>>
>> I have a collection with 100M docs, T field has 30M unique values in
>> entire collection. But my query result returns only docs with 2 different T
>> values,
>>
>> {
>> “q”: “some query”, //whose result has only 2 different T values
>> “facet”: {
>> “type”: “terms”,
>> “field”: “T”,
>> “limit”: 15
>> }
>>
>> Performance and resource is still affected by 30M unique values of T
>> right?
>>
>> If this is correct, can/how “method”: “dvhash” help in this case?
>> If yes, does the default method “smart” take this into account and use
>> the dvhash, so I shouldn’t to set it explicitly?
>>
>> Nice weekends
>> ~ufuk
>>
>


Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
> Performance and resource is still affected by 30M unique values of T
right?
Yes. The main performance issue would be the per-request allocation of a
30M-element `long[]` for "dv" or "uif" methods (which are by far the most
common methods in practice). With low enough request volume and large
enough heap you might not actually perceive a difference in performance;
but if you encounter problems for the use case you describe, this array
allocation would likely be the cause. (also note that the relevant field
cardinality is the _per-shard_ cardinality, so in a multi-shard collection
the size of the allocated arrays might be somewhat less than the overall
field cardinality)

I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
moment, but rather must be specified explicitly:
https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128

The code snippet above indicates some other restrictions that you're
probably already aware of (doesn't work with prefixes or mincount==0, or
for multi-valued or numeric types); otherwise though (for non-numeric
single-valued field) I think the situation you describe (high-cardinality
field, known low-cardinality for the particular domain) sounds like a
perfect use-case for dvhash.

Michael

On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz 
wrote:

> Hello,
>
> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
>
> As I understand the main determinator of performance and RAM usage of a
> terms facet is cardinality of the field in whole collection, but not the
> cardinality of field in query result.
>
> I have a collection with 100M docs, T field has 30M unique values in
> entire collection. But my query result returns only docs with 2 different T
> values,
>
> {
> “q”: “some query”, //whose result has only 2 different T values
> “facet”: {
> “type”: “terms”,
> “field”: “T”,
> “limit”: 15
> }
>
> Performance and resource is still affected by 30M unique values of T right?
>
> If this is correct, can/how “method”: “dvhash” help in this case?
> If yes, does the default method “smart” take this into account and use the
> dvhash, so I shouldn’t to set it explicitly?
>
> Nice weekends
> ~ufuk
>


Clarification on term facet method dvhash

2021-02-05 Thread ufuk yılmaz
Hello,

I’m using Solr 8.4. Very excited about performance improvements in 8.8: 
http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html

As I understand the main determinator of performance and RAM usage of a terms 
facet is cardinality of the field in whole collection, but not the cardinality 
of field in query result.

I have a collection with 100M docs, T field has 30M unique values in entire 
collection. But my query result returns only docs with 2 different T values,

{
“q”: “some query”, //whose result has only 2 different T values
“facet”: {
“type”: “terms”,
“field”: “T”,
“limit”: 15
}

Performance and resource is still affected by 30M unique values of T right?

If this is correct, can/how “method”: “dvhash” help in this case?
If yes, does the default method “smart” take this into account and use the 
dvhash, so I shouldn’t to set it explicitly?

Nice weekends
~ufuk