Re: ram estimate for docvalues is incorrect

2020-05-27 Thread David Smiley
John: you may benefit from more eagerly merging small segments on commit.
At Salesforce we have a *ton* of indexes, and we reduced the segment count
in half from the default.  The large number of fields was a positive factor
in this being a desirable trade-off.  You might look at this recent issue
https://issues.apache.org/jira/browse/LUCENE-8962 which isn't released but
I show in it (with PRs to code) how to accomplish it without hacking on
Lucene itself.  You may find this conference presentation I gave with my
colleagues interesting, which touch on this:
https://youtu.be/hqeYAnsxPH8?t=855

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 27, 2020 at 5:21 PM John Wang  wrote:

> Thanks Adrien!
>
> It is surprising to learn this is an invalid use case and that Lucene is
> planning to get rid of memory accounting...
>
> In our test, there are indeed many fields. From our test, with 1000
> numeric doc values fields, and 5 million docs in 1 segment. (We will have
> many segments in our production use case.)
>
> We found the memory usage by accounting for the elements in the maps vs
> the default behavior is 363456 to 59216, almost a 600% difference.
>
> We have deployments with much more than 1000 fields, so I don't think that
> is extreme.
>
> Our use case:
>
> We will have many segments/readers, and we found opening them at query
> time is expensive. So we are caching them.
>
> Since we don't know the data ahead of the time, we are using the reader's
> accounted memory as the cache size.
>
> We found the reader's accounting is unreliable, and dug into it and found
> this.
>
> If we should not be using this, what would be the correct way to handle
> this?
>
> Thank you
>
> -John
>
>
> On Wed, May 27, 2020 at 1:36 PM Adrien Grand  wrote:
>
>> A couple major versions ago, Lucene required tons of heap memory to keep
>> a reader open, e.g. norms were on heap and so on. To my knowledge, the only
>> thing that is now kept in memory and is a function of maxDoc is live docs,
>> all other codec components require very little memory. I'm actually
>> wondering that we should remove memory accounting on readers. When Lucene
>> used tons of memory we could focus on the main contributors to memory usage
>> and be mostly correct. But now given how little memory Lucene is using it's
>> quite hard to figure out what are the main contributing factors to memory
>> usage. And it's probably not that useful either, why is it important to
>> know how much memory something is using if it's not much?
>>
>> So I'd be curious to know more about your use-case for reader caching.
>> Would we break your use-case if we removed memory accounting on readers?
>> Given the lines that you are pointing out, I believe that you either have
>> many fields or many segments if these maps are using lots of memory?
>>
>>
>> On Wed, May 27, 2020 at 9:52 PM John Wang  wrote:
>>
>>> Hello,
>>>
>>> We have a reader cache that depends on the memory usage for each reader.
>>> We found the calculation of reader size for doc values to be under counting.
>>>
>>> See line:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69
>>>
>>> Looks like the memory estimate is only using the shallow size of the
>>> class, and does not include the objects stored in the maps:
>>>
>>>
>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55
>>>
>>> We made a local patch and saw a significant difference in reported size.
>>>
>>> Please let us know if this is the right thing to do, we are happy to
>>> contribute our patch.
>>>
>>> Thanks
>>>
>>> -John
>>>
>>
>>
>> --
>> Adrien
>>
>


Re: ram estimate for docvalues is incorrect

2020-05-27 Thread John Wang
Thanks Adrien!

It is surprising to learn this is an invalid use case and that Lucene is
planning to get rid of memory accounting...

In our test, there are indeed many fields. From our test, with 1000 numeric
doc values fields, and 5 million docs in 1 segment. (We will have many
segments in our production use case.)

We found the memory usage by accounting for the elements in the maps vs the
default behavior is 363456 to 59216, almost a 600% difference.

We have deployments with much more than 1000 fields, so I don't think that
is extreme.

Our use case:

We will have many segments/readers, and we found opening them at query time
is expensive. So we are caching them.

Since we don't know the data ahead of the time, we are using the reader's
accounted memory as the cache size.

We found the reader's accounting is unreliable, and dug into it and found
this.

If we should not be using this, what would be the correct way to handle
this?

Thank you

-John


On Wed, May 27, 2020 at 1:36 PM Adrien Grand  wrote:

> A couple major versions ago, Lucene required tons of heap memory to keep a
> reader open, e.g. norms were on heap and so on. To my knowledge, the only
> thing that is now kept in memory and is a function of maxDoc is live docs,
> all other codec components require very little memory. I'm actually
> wondering that we should remove memory accounting on readers. When Lucene
> used tons of memory we could focus on the main contributors to memory usage
> and be mostly correct. But now given how little memory Lucene is using it's
> quite hard to figure out what are the main contributing factors to memory
> usage. And it's probably not that useful either, why is it important to
> know how much memory something is using if it's not much?
>
> So I'd be curious to know more about your use-case for reader caching.
> Would we break your use-case if we removed memory accounting on readers?
> Given the lines that you are pointing out, I believe that you either have
> many fields or many segments if these maps are using lots of memory?
>
>
> On Wed, May 27, 2020 at 9:52 PM John Wang  wrote:
>
>> Hello,
>>
>> We have a reader cache that depends on the memory usage for each reader.
>> We found the calculation of reader size for doc values to be under counting.
>>
>> See line:
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69
>>
>> Looks like the memory estimate is only using the shallow size of the
>> class, and does not include the objects stored in the maps:
>>
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55
>>
>> We made a local patch and saw a significant difference in reported size.
>>
>> Please let us know if this is the right thing to do, we are happy to
>> contribute our patch.
>>
>> Thanks
>>
>> -John
>>
>
>
> --
> Adrien
>


Re: ram estimate for docvalues is incorrect

2020-05-27 Thread Adrien Grand
A couple major versions ago, Lucene required tons of heap memory to keep a
reader open, e.g. norms were on heap and so on. To my knowledge, the only
thing that is now kept in memory and is a function of maxDoc is live docs,
all other codec components require very little memory. I'm actually
wondering that we should remove memory accounting on readers. When Lucene
used tons of memory we could focus on the main contributors to memory usage
and be mostly correct. But now given how little memory Lucene is using it's
quite hard to figure out what are the main contributing factors to memory
usage. And it's probably not that useful either, why is it important to
know how much memory something is using if it's not much?

So I'd be curious to know more about your use-case for reader caching.
Would we break your use-case if we removed memory accounting on readers?
Given the lines that you are pointing out, I believe that you either have
many fields or many segments if these maps are using lots of memory?


On Wed, May 27, 2020 at 9:52 PM John Wang  wrote:

> Hello,
>
> We have a reader cache that depends on the memory usage for each reader.
> We found the calculation of reader size for doc values to be under counting.
>
> See line:
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69
>
> Looks like the memory estimate is only using the shallow size of the
> class, and does not include the objects stored in the maps:
>
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55
>
> We made a local patch and saw a significant difference in reported size.
>
> Please let us know if this is the right thing to do, we are happy to
> contribute our patch.
>
> Thanks
>
> -John
>


-- 
Adrien


ram estimate for docvalues is incorrect

2020-05-27 Thread John Wang
Hello,

We have a reader cache that depends on the memory usage for each reader. We
found the calculation of reader size for doc values to be under counting.

See line:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69

Looks like the memory estimate is only using the shallow size of the class,
and does not include the objects stored in the maps:

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55

We made a local patch and saw a significant difference in reported size.

Please let us know if this is the right thing to do, we are happy to
contribute our patch.

Thanks

-John


Re: Skip indexing facet drill down terms

2020-05-27 Thread Michael McCandless
Hi Ankur,

Indeed I don't think this is an option, in FacetsConfig today?

I think it makes sense to add one ... it should be fairly simple.  Just
follow how the "foo" level (dimension only) drilldown option was added in
LUCENE-8367?

Could you open an issue?

Thanks!

Mike McCandless

http://blog.mikemccandless.com


On Tue, May 26, 2020 at 4:10 PM Ankur Goel  wrote:

> (forgot to put subject in previous e-mail,  resending)
>
> Hello Folks,
> We have been using faceted search feature of lucene in our search
> application without the feature that lets you drill down on a facet
> dimension.
> With https://issues.apache.org/jira/browse/LUCENE-8367 you get the
> ability to skip indexing drill-down terms for top level dimension - '*foo*'
> for path '*foo/bar*'.
>
> However, a StringField still gets created for path *'foo/bar*' which will
> never be queried in our application wasting index space.
> Ideally we would like `o.a.l.f.FacetsConfig` to provide an option to skip
> indexing drill-down terms all together.
>
> I wonder if community has other ideas on this.
>
> Thanks
> -Ankur
>


Re: BadApple report

2020-05-27 Thread Jason Gerlowski
> Hoss’s rollups are here: 
> http://fucit.org/solr-jenkins-reports/failure-report.html which show the 
> rates, but not where they came from.

If I click on a particular test entry on "failure-report.html", I'm
presented with dialog with links for each failure.  Clicking that link
takes me to a file listing page (e.g.
http://fucit.org/solr-jenkins-reports/job-data/apache/Lucene-Solr-Tests-8.x/1569/),
with Jenkins logs, etc. for that particular failure.  Notably, it also
has a file called "url.txt" with a link to the actual failure in
Jenkins (e.g. 
http://fucit.org/solr-jenkins-reports/job-data/apache/Lucene-Solr-Tests-8.x/1569/url.txt).

Just mentioning what I've seen with a few I've clicked on.  The
rollups might not have that for all failures, or for all different
source-Jenkins.  Just wanted to mention that you can get back to the
Jenkins job in at least _some_ cases with a bit of clicking.

On Mon, May 25, 2020 at 1:27 PM Ilan Ginzburg  wrote:
>
> Thanks that helps. I'll try to have a look at some of the failures related to 
> areas I know.
>
> Ilan
>
> On Mon, May 25, 2020 at 7:07 PM Erick Erickson  
> wrote:
>>
>> Ilan:
>>
>> That’s, unfortunately, not an easy question. Hoss’s rollups are here: 
>> http://fucit.org/solr-jenkins-reports/failure-report.html which show the 
>> rates, but not where they came from.
>>
>> Here’s an example of a failure from Jenkins, if you follow the link you can 
>> see the full output, (click “console output”, then “full log”): 
>> https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Linux/3181/. I usually see 
>> the individual ones go by by subscribing to “bui...@lucene.apache.org”.
>>
>> Otherwise, what I often do is use Mark Miller’s “beasting” script to see if 
>> I can get it to reproduce locally and go from there:
>>
>> https://gist.github.com/markrmiller/dbdb792216dc98b018ad
>>
>> It’s all complicated by the fact that the failures are intermittent.
>>
>> Best,
>> Erick
>>
>> > On May 25, 2020, at 11:22 AM, Ilan Ginzburg  wrote:
>> >
>> > Where are the test failure details?
>> >
>> > On Mon, May 25, 2020 at 4:47 PM Erick Erickson  
>> > wrote:
>> > Here’s the summary:
>> >
>> > Raw fail count by week totals, most recent week first (corresponds to 
>> > bits):
>> > Week: 0  had  113 failures
>> > Week: 1  had  103 failures
>> > Week: 2  had  102 failures
>> > Week: 3  had  343 failures
>> >
>> >
>> > Failures in Hoss' reports for the last 4 rollups.
>> >
>> > There were 511 unannotated tests that failed in Hoss' rollups. Ordered by 
>> > the date I downloaded the rollup file, newest->oldest. See above for the 
>> > dates the files were collected
>> > These tests were NOT BadApple'd or AwaitsFix'd
>> >
>> > Failures in the last 4 reports..
>> >Report   Pct runsfails   test
>> >  0123   0.7 1593 40  BasicDistributedZkTest.test
>> >  0123   2.1 1518 28  MultiThreadedOCPTest.test
>> >  0123   0.7 1613 14  RollingRestartTest.test
>> >  0123   7.1 1635 44  
>> > ScheduledTriggerIntegrationTest.testScheduledTrigger
>> >  0123   2.4 1614 17  
>> > SearchRateTriggerTest.testWaitForElapsed
>> >  0123   0.2 1614  6  
>> > ShardSplitTest.testSplitShardWithRuleLink
>> >  0123   0.5 1577  5  
>> > SolrCloudReportersTest.testExplicitConfiguration
>> >  0123   0.7 1560 19  TestInPlaceUpdatesDistrib.test
>> >  0123   1.0 1566 17  TestPackages.testPluginLoading
>> >  0123   0.8 1598  7  
>> > TestQueryingOnDownCollection.testQueryToDownCollectionShouldFailFast
>> >  0123   0.7 1598  8  TestSimScenario.testAutoAddReplicas
>> > 
>> >
>> >
>> > Full report:
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Solr Java-API Question

2020-05-27 Thread Ruscheinski, Johannes
Treating them as strings worked, thanks!


Johannes

--
Dr. Johannes Ruscheinski
Universitätsbibliothek Tübingen - IT-Abteilung -
Wilhelmstr. 32, 72074 Tübingen

Tel: +49 7071 29-72820
FAX: +49 7071 29-5069
Email: johannes.ruschein...@uni-tuebingen.de

 The Sophisticate:  "The world isn't black and white.  No one does pure good or 
pure bad. It's all gray.  Therefore, no one is better than anyone else."
The Zetet:  "Knowing only gray, you conclude that all grays are the same 
shade.  You mock the simplicity of the two-color view, yet you replace it with 
a one-color view..."
  —Marc Stiegler, David's Sling



From: David Smiley 
Sent: 26 May 2020 22:34:45
To: Solr/Lucene Dev
Subject: Re: Solr Java-API Question

I don't know about SolrMARC.  But with respect to input/outout to Solr of these 
fields, treat them as Strings encoded as described in the docs.
~ David


On Tue, May 26, 2020 at 5:56 AM Ruscheinski, Johannes 
mailto:johannes.ruschein...@uni-tuebingen.de>>
 wrote:

Hi David,


I am stuck again.  In particular, I don't know how to initialize my 
DateRangeFields in Java.  I am trying to implement a function with a signature 
as follows:


public static List getDateRanges(final Record record, final String 
rangeFieldTag) {


My problem is what to use as the type parameter for List<>.  We're using 
SolrMARC.  It is not immediately obvious from the documentation as to how to do 
that.  "record" here is a MARC record that contains some field with ranges that 
I need to convert to a list of something that I can use to populate my 
DateRangeField instances.


Johannes

--
Dr. Johannes Ruscheinski
Universitätsbibliothek Tübingen - IT-Abteilung -
Wilhelmstr. 32, 72074 Tübingen

Tel: +49 7071 29-72820
FAX: +49 7071 29-5069
Email: 
johannes.ruschein...@uni-tuebingen.de

 The Sophisticate:  "The world isn't black and white.  No one does pure good or 
pure bad. It's all gray.  Therefore, no one is better than anyone else."
The Zetet:  "Knowing only gray, you conclude that all grays are the same 
shade.  You mock the simplicity of the two-color view, yet you replace it with 
a one-color view..."
  —Marc Stiegler, David's Sling



From: Ruscheinski, Johannes 
mailto:johannes.ruschein...@uni-tuebingen.de>>
Sent: 26 May 2020 10:55:41
To: dev@lucene.apache.org
Subject: Re: Solr Java-API Question


Hi David,


I just came back after a few days off and wanted to thank you for your help!  
I'll be following your suggestion and will be using DateRangeField.

Johannes
--
Dr. Johannes Ruscheinski
Universitätsbibliothek Tübingen - IT-Abteilung -
Wilhelmstr. 32, 72074 Tübingen

Tel: +49 7071 29-72820
FAX: +49 7071 29-5069
Email: 
johannes.ruschein...@uni-tuebingen.de

 The Sophisticate:  "The world isn't black and white.  No one does pure good or 
pure bad. It's all gray.  Therefore, no one is better than anyone else."
The Zetet:  "Knowing only gray, you conclude that all grays are the same 
shade.  You mock the simplicity of the two-color view, yet you replace it with 
a one-color view..."
  —Marc Stiegler, David's Sling



From: David Smiley mailto:david.w.smi...@gmail.com>>
Sent: 20 May 2020 16:20:15
To: Solr/Lucene Dev
Subject: Re: Solr Java-API Question

I hope this helps:
https://lucene.apache.org/solr/guide/8_5/working-with-dates.html

No, LongPointField only does single points, not point-ranges.  Some day we need 
a dedicated LongRangeField and similar for other primitives.

~ David


On Wed, May 20, 2020 at 3:25 AM Ruscheinski, Johannes 
mailto:johannes.ruschein...@uni-tuebingen.de>>
 wrote:

Hi David,


thanks for the advice.  I hope I get the necessary resolution this way.  As I 
understand the recommendation it is to use seconds.  I think we need 9 decimal 
places.  So that's 10^9 s which is not quite 32 years.  Also how do I issue my 
intersection queries at the Java API level once I have populated the DateRange 
fields?


Johannes

--
Dr. Johannes Ruscheinski
Universitätsbibliothek Tübingen - IT-Abteilung -
Wilhelmstr. 32, 72074 Tübingen

Tel: +49 7071 29-72820
FAX: +49 7071 29-5069
Email: 
johannes.ruschein...@uni-tuebingen.de

 The Sophisticate:  "The world isn't black and white.  No one does pure good or 
pure bad. It's all gray.  Therefore, no one is better than anyone else."
The Zetet:  "Knowing only gray, you conclude that all grays are the same 
shade.  You mock the simplicity of the two-color view, yet you replace it with 
a one-color view..."
  —Marc Stiegler, David's Sling



From: David Smiley mailto:david.w.smi...@gmail.com>>
Sent: 19 May 2020 16:17:38
To: Solr/Lucene Dev
Subject: Re: Solr Java-API Question

Hi,

There's a wiki page on this: