Re: Lucene (unexpected ) fsync on existing segments

2021-03-27 Thread Rahul Goswami
Hello,
Opened the below JIRA for this issue. I will work on this and try to submit
a patch.
[LUCENE-9889] Lucene (unexpected ) fsync on existing segments - ASF JIRA
(apache.org) 

Thanks,
Rahul

On Fri, Mar 26, 2021 at 9:56 AM Rahul Goswami  wrote:

> Mike,
>
>  >> "But, I believe you (system locks up with MMapDirectory for you
> use-case), so there is a bug somewhere!  And I wish we could get to the
> bottom of that, and fix it."
>
> Yes that's true for Windows for sure. I haven't tested it on Unix-like
> systems to that scale, so don't have any observations to report there.
>
> >> "Also, this (system locks up when using MMapDirectory) sounds different
> from the "Lucene fsyncs files that it doesn't need to" bug, right?"
>
> That's correct, they are separate issues. I just brought up the
> system-freezing-up-on-Windows point in response to Uwe's explanation
> earlier.
>
> I know I had taken it upon myself to open up a Jira for the fsync issue,
> but it got delayed from my side as I got occupied with other things
> in my day job. Will open up one later today.
>
> Thanks,
> Rahul
>
>
> On Wed, Mar 24, 2021 at 12:58 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> MMapDirectory really should be (is supposed to be) better than
>> SimpleFSDirectory for your usage case.
>>
>> Memory mapped pages do not have to fit into your 64 GB physical space,
>> but the "hot" pages (parts of the index that you are actively querying)
>> ideally would fit mostly in free RAM on your box to have OK search
>> performance.  Run with as small a JVM heap as possible so the OS has the
>> most RAM to keep such pages hot.  Since you are getting OK performance with
>> SimpleFSDirectory it sounds like you do have enough free RAM for the parts
>> of the index you are searching...
>>
>> But, I believe you (system locks up with MMapDirectory for you use-case),
>> so there is a bug somewhere!  And I wish we could get to the bottom of
>> that, and fix it.
>>
>> Also, this (system locks up when using MMapDirectory) sounds different
>> from the "Lucene fsyncs files that it doesn't need to" bug, right?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Mar 15, 2021 at 4:28 PM Rahul Goswami 
>> wrote:
>>
>>> Uwe,
>>> I understand that mmap would only map *a part* of the index from virtual
>>> address space to physical memory as and when the pages are requested.
>>> However the limitation on our side is that in most cases, we cannot ask for
>>> more than 128 GB RAM (and unfortunately even that would be a stretch) for
>>> the Solr machine.
>>>
>>> I have read and re-read the article you referenced in the past :) It's
>>> brilliantly written and did help clarify quite a few things for me I must
>>> say. However, at the end of the day, there is only so much the OS (at least
>>> Windows) can do before it starts to swap different pages in a 2-3 TB index
>>> into 64 GB of physical space, isn't that right ? The CPU usage spikes to
>>> 100% at such times and the machine becomes totally unresponsive. Turning on
>>> SimpleFSDIrectory at such times does rid us of this issue. I understand
>>> that we are losing out on performance by an order of magnitude compared to
>>> mmap, but I don't know any alternate solution. Also, since most of our use
>>> cases are more write-heavy than read-heavy, we can afford to compromise on
>>> the search performance due to SimpleFS.
>>>
>>> Please let me know still, if there is anything about my explanation that
>>> doesn't sound right to you.
>>>
>>> Thanks,
>>> Rahul
>>>
>>> On Mon, Mar 15, 2021 at 3:54 PM Uwe Schindler  wrote:
>>>
 This is not true. Memory mapping does not need to load the index into
 ram, so you don't need so much physical memory. Paging is done only between
 index files and ram, that's what memory mapping is about.

 Please read the blog post:
 https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 Uwe

 Am March 15, 2021 7:43:29 PM UTC schrieb Rahul Goswami <
 rahul196...@gmail.com>:
>
> Mike,
> Yes I am using a 64 bit JVM on Windows. I haven't tried reproducing
> the issue on Linux yet. In the past we have had problems with mmap on
> Windows with the machine freezing. The rationale I gave to myself is the
> amount of disk and CPU activity for paging in and out must be intense for
> the OS while trying to map an index that large into 64 GB of heap. Also
> since it's an on-premise deployment, we can't expect the customers of the
> product to provide nodes with > 400 GB RAM which is what *I think* would 
> be
> required to get a decent performance with mmap. Hence we had to switch to
> SimpleFSDirectory.
>
> As for the fsync behavior, you are right. I tried with
> NRTCachingDirectoryFactory as well which defaults to using mmap underneath
> and still makes fsync calls for already existing 

Re: 9.0 release

2021-03-27 Thread Jan Høydahl
Hi,

Where are we at with the Lucene 9.0 release planning?

The git split is largely done. Not sure about the build.
Let's update the umbrella issue 
https://issues.apache.org/jira/browse/LUCENE-9375 
 for known remaining cleanup 
tasks.
The one on that list is releaseWizard, but as Adrien says there are also other 
scripts that need updating.

Jan

> 13. jan. 2021 kl. 15:10 skrev Adrien Grand :
> 
> +1 to start planning 9.0.
> 
> Since you mentioned the Gradle build, I believe that we still need to migrate 
> some of the release tooling from Ant to Gradle, e.g. 
> dev-tools/scripts/addBackcompatIndexes.py. These scripts are not easy to test 
> without actually doing a release so the 9.0 RM might have some debugging to 
> do.
> 
> 
> On Mon, Dec 28, 2020 at 7:17 PM Michael Sokolov  > wrote:
> Hi everyone, as we head into a new year full of optimism, is it time
> to start discussing the next major release? We released 8.0 on Jun 18,
> 2019, over 18 months ago. Since then we've switched to a gradle-based
> build. We have added vector-valued fields and an HNSW neighbor search
> algorithm for them.  At the same time Solr has been getting a major
> overhaul which should justify a release, I think? IIRC there was talk
> of making 9.0 be the first release of Solr as its own TLP. Is it time
> to start planning for that now?
> 
> -Mike
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> 
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> 
> 
> 
> 
> -- 
> Adrien



Re: Bugfix release Lucene/Solr 8.8.2

2021-03-27 Thread Mike Drob
Ishan,

Thank you for bringing this up. I’m comfortable delaying an extra week to
accommodate the multitude of holidays (Holi, Passover, others) coming up.

I will adjust my schedule to start the vote Tuesday, Apr 6.

Please make sure that all back ports are appropriately marked with
fixVersion in Jira and have corresponding CHANGES entries.

Mike

On Fri, Mar 26, 2021 at 11:11 PM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Hi Mike,
>
> I wish to get https://issues.apache.org/jira/browse/SOLR-15288 in, but
> will likely be able to wrap up by 2 April or so (on vacation right now due
> to the festival of Holi)
>
> Regards,
> Ishan
>
> On Sat, 27 Mar, 2021, 7:41 am Mike Drob,  wrote:
>
>> I am now preparing for a bugfix release from branch branch_8_8
>>
>> I plan to have the RC built and vote started on Tuesday, Mar 30. If you
>> have small, low risk bug fixes to backport before then, please do so using
>> your best judgement.
>>
>> Please observe the normal rules for committing to this branch:
>>
>> * Before committing to the branch, reply to this thread and argue
>>   why the fix needs backporting and how long it will take.
>> * All issues accepted for backporting should be marked with 8.8.2
>>   in JIRA, and issues that should delay the release must be marked as
>> Blocker
>> * All patches that are intended for the branch should first be committed
>>   to the unstable branch, merged into the stable branch, and then into
>>   the current release branch.
>> * Only Jira issues with Fix version 8.8.2 and priority "Blocker" will
>> delay
>>   a release candidate build.
>>
>> Thanks,
>> Mike
>>
>


Re: Questions about the new vector API

2021-03-27 Thread Dmitry Kan
Michael,

I got some interest in this area and have been doing comparative study of
different KNN implementations and blogging about it.

Did you use nmslib for HNSW implementation or something else?

On Tue, 16 Mar 2021 at 22:47, Michael Sokolov  wrote:

> Yeah, HNSW is problematic in a few ways: (1) merging is costly due to
> the need to completely recreate the graph. (2) searching across a
> segmented index sacrifices much of the performance benefit of HNSW
> since the cost of searching HNSW graphs scales ~logarithmically with
> the size of the graph, so splitting into multiple graphs and then
> merge sorting results is pretty expensive. I guess the random access /
> scan forward dynamic is another problematic area.
>
> On Tue, Mar 16, 2021 at 1:28 PM Robert Muir  wrote:
> >
> > Maybe that is so, but we should factor in everything: such as large
> scale indexing, not requiring whole data set to be in RAM, etc. Hey, it's
> Lucene!
> >
> > Because HNSW has dominated the nightly benchmarks, I have been digging
> through stacktraces and trying to figure out ways to make it work
> efficiently, and I'm not sure what to do.
> > Especially merge is painful: it seems to cause a storm of page
> faults/random accesses due to how it works, and I don't know yet how to
> make it better.
> > It seems to rebuild the entire graph, spraying random accesses across a
> "slow-wrapper" that binary searches each sub on every access.
> > I don't see any way to even amortize the pain with some kind of bulk
> merge trick.
> >
> > So if we find algorithms that scale better, I think we should lend a
> preference towards them. For example, algorithms that allow
> per-segment/sequential index and merge.
> >
> > On Tue, Mar 16, 2021 at 1:06 PM Michael Sokolov 
> wrote:
> >>
> >> ann-benchmarks.com maintains open benchmarks of a bunch of ANN
> >> (approximate NN) algorithms. When we started this effort, HNSW was at
> >> the top of the heap in most of the benchmarks.
> >>
> >> On Tue, Mar 16, 2021 at 12:28 PM Robert Muir  wrote:
> >> >
> >> > Where are the alternative algorithms that work on sequential
> iterators and don't need random access?
> >> >
> >> > Seems like these should be the ones we initially add to lucene, and
> HNSW should be put aside for now? (is it a toy, or can we do it without
> jazillions of random accesses?)
> >> >
> >> > On Tue, Mar 16, 2021 at 12:15 PM Michael Sokolov 
> wrote:
> >> >>
> >> >> There's also some good discussion on
> >> >> https://issues.apache.org/jira/browse/LUCENE-9583 about random
> access
> >> >> vs iterator pattern that never got fully resolved. We said we would
> >> >> revisit after KNN (LUCENE-9004) landed, and now it has. The usage of
> >> >> random access is pretty well-established there, maybe we should
> >> >> abandon the iterator API since it is redundant (you can always
> iterate
> >> >> over a random access API if you know the size)?
> >> >>
> >> >> On Tue, Mar 16, 2021 at 12:10 PM Michael Sokolov 
> wrote:
> >> >> >
> >> >> > Also, Tomoko re:LUCENE-9322, did it succeed? I guess we won't know
> for
> >> >> > sure unless someone revives
> >> >> > https://issues.apache.org/jira/browse/LUCENE-9136 or something
> like
> >> >> > that
> >> >> >
> >> >> > On Tue, Mar 16, 2021 at 12:04 PM Michael Sokolov <
> msoko...@gmail.com> wrote:
> >> >> > >
> >> >> > > Consistent plural naming makes sense to me. I think it ended up
> >> >> > > singular because I am biased to avoid plural names unless there
> is a
> >> >> > > useful distinction to be made. But consistency should trump my
> >> >> > > predilections.
> >> >> > >
> >> >> > > I think the reason we have search() on VectorValues is that we
> have
> >> >> > > LeafReader.getVectorValues() (by analogy to the DocValues
> iterators),
> >> >> > > but no way to access the VectorReader. Do you think we should
> also
> >> >> > > have LeafReader.getVectorReader()? Today it's only on
> CodecReader.
> >> >> > >
> >> >> > > Re: SearchStrategy.NONE; the idea is we support efficient access
> to
> >> >> > > floating point values. Using BinaryDocValues for this will always
> >> >> > > require an additional decoding step. I can see that the naming is
> >> >> > > confusing there. The intent is that you index the vector values,
> but
> >> >> > > no additional indexing data structure. Also: the reason HNSW is
> >> >> > > mentioned in these SearchStrategy enums is to make room for other
> >> >> > > vector indexing approaches, like LSH. There was a lot of
> discussion
> >> >> > > that we wanted an API that allowed for experimenting with other
> >> >> > > techniques for indexing and searching vector values.
> >> >> > >
> >> >> > > Adrien, you made an analogy to PerFieldPostingsFormat (and
> DocValues),
> >> >> > > but I think the situation is more akin to Points, where we have
> the
> >> >> > > options on IndexableField. The metadata we store there
> (dimension and
> >> >> > > score function) don't really result in different formats, ie code
> >> >> > > paths for