I'd also love to understand this:

> using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work well on
Windows for our index sizes which commonly run north of 1 TB)

Is this a known problem on certain versions of Windows?  Normally memory
mapped IO can scale to very large sizes (well beyond system RAM) an the OS
does the right thing (caches the frequently accessed parts of the index).

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 7, 2023 at 7:23 AM Adrien Grand <jpou...@gmail.com> wrote:

> I agree it's worth discussing. I opened
> https://github.com/apache/lucene/issues/12355 and
> https://github.com/apache/lucene/issues/12356.
>
> On Tue, Jun 6, 2023 at 9:17 PM Rahul Goswami <rahul196...@gmail.com>
> wrote:
> >
> > Thanks Adrien. I spent some time trying to understand the readByte() in
> > ReverseRandomAccessReader (through FST) and compare with 7.x.  Although I
> > don't understand ALL of the details and reasoning for always loading the
> > FST (and in turn the term index) off-heap (as discussed in
> > https://github.com/apache/lucene/issues/10297 ) I understand that this
> is
> > essentially causing disk access for every single byte during readByte().
> >
> > Does this warrant a JIRA for regression?
> >
> > As mentioned, I am noticing a 10x slowdown in
> SegmentTermsEnum.seekExact()
> > affecting atomic update performance . For setups like mine that can't use
> > mmap due to large indexes this would be a legit regression, no?
> >
> > - Rahul
> >
> > On Tue, Jun 6, 2023 at 10:09 AM Adrien Grand <jpou...@gmail.com> wrote:
> >
> > > Yes, this changed in 8.x:
> > >  - 8.0 moved the terms index off-heap for non-PK fields with
> > > MMapDirectory. https://github.com/apache/lucene/issues/9681
> > >  - Then in 8.6 the FST was moved off-heap all the time.
> > > https://github.com/apache/lucene/issues/10297
> > >
> > > More generally, there's a few files that are no longer loaded in heap
> > > in 8.x. It should be possible to load them back in heap by doing
> > > something like that (beware, I did not actually test this code):
> > >
> > > class MyHeapDirectory extends FilterDirectory {
> > >
> > >   MyHeapDirectory(Directory in) {
> > >     super(in);
> > >   }
> > >
> > >   @Override
> > >   public IndexInput openInput(String name, IOContext context) throws
> > > IOException {
> > >     if (context.load == false) {
> > >       return super.openInput(name, context);
> > >     } else {
> > >       try (IndexInput in = super.openInput(name, context)) {
> > >         byte[] bytes = new byte[Math.toIntExact(in.length())];
> > >         in.readBytes(bytes, bytes.length);
> > >         ByteBuffer bb =
> > >
> ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asReadOnlyBuffer();
> > >         return new ByteBuffersIndexInput(new
> > > ByteBuffersDataInput(Collections.singletonList(bb)),
> > > "ByteBuffersIndexInput(" + name + ")");
> > >       }
> > >     }
> > >   }
> > >
> > > }
> > >
> > > On Tue, Jun 6, 2023 at 3:41 PM Rahul Goswami <rahul196...@gmail.com>
> > > wrote:
> > > >
> > > > Thanks Adrien. Is this behavior of FST something that has changed in
> > > Lucene
> > > > 8.x (from 7.x)?
> > > > Also, is the terms index not loaded into memory anymore in 8.x?
> > > >
> > > > To your point on MMapDirectoryFactory, it is much faster as you
> > > > anticipated, but the indexes commonly being >1 TB makes the Windows
> > > machine
> > > > freeze to a point I sometimes can't even connect to the VM.
> > > > SimpleFSDirectory works well for us from that standpoint.
> > > >
> > > > To add, both NIOFS and SimpleFS have similar indexing benchmarks on
> > > > Windows. I understand it is because of the Java bug which
> synchronizes
> > > > internally in the native call for NIOFs.
> > > >
> > > > -Rahul
> > > >
> > > > On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand <jpou...@gmail.com>
> wrote:
> > > >
> > > > > +Alan Woodward helped me better understand what is going on here.
> > > > > BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
> > > > > doesn't play well with the fact that the FST reads bytes backwards:
> > > > > every call to readByte() triggers a refill of 1kB because it wants
> to
> > > > > read the byte that is just before what the buffer contains.
> > > > >
> > > > > On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand <jpou...@gmail.com>
> wrote:
> > > > > >
> > > > > > My best guess based on your description of the issue is that
> > > > > > SimpleFSDirectory doesn't like the fact that the terms index now
> > > reads
> > > > > > data directly from the directory instead of loading the terms
> index
> > > in
> > > > > > heap. Would you be able to run the same benchmark with
> MMapDirectory
> > > > > > to check if it addresses the regression?
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami <
> rahul196...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > > We started experiencing slowness with atomic updates in Solr
> after
> > > > > > > upgrading from 7.7.2 to 8.11.1. Running several tests revealed
> the
> > > > > > > slowness to be in RealTimeGet's
> SolrIndexSearcher.getFirstMatch()
> > > call
> > > > > > > which eventually calls Lucene's SegmentTermsEnum.seekExact()..
> > > > > > >
> > > > > > > In the benchmarks I ran, 8.11.1 is about 10x slower than 7.7.2.
> > > After
> > > > > > > discussion on the Solr mailing list I created the below JIRA:
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/SOLR-16838
> > > > > > >
> > > > > > > The thread dumps collected show a lot of threads stuck in the
> > > > > > > FST.findTargetArc()
> > > > > > > method. Testing environment details:
> > > > > > >
> > > > > > > Environment details:
> > > > > > > - Java 11 on Windows server
> > > > > > > - Xms1536m Xmx3072m
> > > > > > > - Indexing client code running 15 parallel threads indexing in
> > > batches
> > > > > of
> > > > > > > 1000 on a standalone core.
> > > > > > > - using SimpleFSDirectoryFactory  (since Mmap doesn't  quite
> work
> > > well
> > > > > on
> > > > > > > Windows for our index sizes which commonly run north of 1 TB)
> > > > > > >
> > > > > > >
> > > > >
> > >
> https://drive.google.com/drive/folders/1q2DPNTYQEU6fi3NeXIKJhaoq3KPnms0h?usp=sharing
> > > > > > >
> > > > > > > Is there a known issue with slowness with
> TermsEnum.seekExact() in
> > > > > Lucene
> > > > > > > 8.x ?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Rahul
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Adrien
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Adrien
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Adrien
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to