Re: [I] Poor multi-threaded indexing performance [lucenenet]

via GitHub Mon, 17 Jun 2024 15:07:10 -0700


rclabo commented on issue #935:
URL: https://github.com/apache/lucenenet/issues/935#issuecomment-2174514794

@superkelvint Thank you for providing some code to show how you are using
Lucene.NET. That helps a lot. Maybe we can work together to make sense out of
what you are seeing.

One challenge in creating any kind of benchmark is having enough data to
index so that it takes some time for LuceneNET to complete. LuceneNET is very
fast, so thousands of records of data are needed. I went on a hunt for an
open-source data set that we could use. Ultimately I chose the [Book Dataset
from Kaggle.com](
https://www.kaggle.com/datasets/saurabhbagchi/books-dataset/data), which is
available under a CC0 public domain license.

The dataset has approx. 271K records and each record contains several fields
(Book Title, Author, URL to cover photo, etc.). To allow the program to load
data in parallel, I created 21 copies of this data which is 5.4 million records
total. Because sometimes it’s nice to be able to run the program more quickly
when tweaking settings to see if they matter much, I also create 21 “small”
data files, which contain about 50,000 records each for a total of 1.05 million
records total.

A single flag at the top of the program causes it to index big or small
files. Either way, it’s indexing 21 files.
I wrote a `DataReader` class that can read one record at a time from one of
these data files. It uses a `Stream`, which in .NET is a performant approach.
The implementation uses synchronous IO, but I tried a version of the app with
async IO (async await) for reading these data files, and it made no difference
from a performance perspective.

Using this console app I tested the performance of Lucene.NET Beta 16, vs,
the current master vs, [PR#940](https://github.com/apache/lucenenet/pull/940)
and I get more or less the same results from each.

I tried different levels of parallelism to see how that affected performance
and also tried different levels of `RAMBufferSizeMB`. In the process, I came
to realize that by default Lucen.NET 4.8 will use a MAX of 8 threads for
creating index segments via DWTP threads when using the IndexWriter. At first,
I thought perhaps this was the issue and a larger value just needed to be
passed via the ` IndexWriterConfig.MaxThreadStates` when the IndexWriter is
created. But that didn’t turn out to be true. In fact, on my hardware I
ultimately got faster performance by passing a SMALLER value then the default
for the `indexWriterConfig.MaxThreadStates` value.

I’m providing this full working .NET console app that uses .NET 8 including
the data used and solution file and structure. So you can easily run this on
your hardware and try different values for the number of threads used to read
the data files, the `MaxThreadStates`, and the `RAMBufferSizeMB` value. What
I’m most interested in, is having you port this code to java and run it using
the exact same dataset in a java environment. The solution with all the data
was 412MB which GitHub wouldn't let me upload (it has a 25MB limit) so I
trimmed the solution down to one small data file and one big one and then when
the app runs it will create the other 20 small data files and 20 big ones on
the fly. This got the app just barely under the upload limit. :-)

Ideally run the console using the big data set, which is how I have the app
configured by default. The app auto-deletes the LuceneIndex folder before each
run. So you can run it multiple times in a row and you will be left with just
the LuceneIndex from the last run in the LuceneIndex folder.

I would love to hear back your Lucene.NET 4.8 results vs Java results and
have you contribute back your java code port of this console app. I'd also
like to understand exactly which version of Java Lucene you are benchmarking
against.

On my machine, Lucene.NET can index the 5.65 million book records in 47
seconds, which is a rate of approximately 119K records per second. That
average time includes the final commit time.

To my eyes that seems pretty darn fast. But if the Java version can really
index these same documents much faster that would be great to know.

[LuceneIndexingPerformane.zip](https://github.com/user-attachments/files/15877758/LuceneIndexingPerformane.zip)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Poor multi-threaded indexing performance [lucenenet]

Reply via email to