rclabo commented on issue #935: URL: https://github.com/apache/lucenenet/issues/935#issuecomment-2174514794
@superkelvint Thank you for providing some code to show how you are using Lucene.NET. That helps a lot. Maybe we can work together to make sense out of what you are seeing. One challenge in creating any kind of benchmark is having enough data to index so that it takes some time for LuceneNET to complete. LuceneNET is very fast, so thousands of records of data are needed. I went on a hunt for an open-source data set that we could use. Ultimately I chose the [Book Dataset from Kaggle.com]( https://www.kaggle.com/datasets/saurabhbagchi/books-dataset/data), which is available under a CC0 public domain license. The dataset has approx. 271K records and each record contains several fields (Book Title, Author, URL to cover photo, etc.). To allow the program to load data in parallel, I created 21 copies of this data which is 5.4 million records total. Because sometimes it’s nice to be able to run the program more quickly when tweaking settings to see if they matter much, I also create 21 “small” data files, which contain about 50,000 records each for a total of 1.05 million records total. A single flag at the top of the program causes it to index big or small files. Either way, it’s indexing 21 files. I wrote a `DataReader` class that can read one record at a time from one of these data files. It uses a `Stream`, which in .NET is a performant approach. The implementation uses synchronous IO, but I tried a version of the app with async IO (async await) for reading these data files, and it made no difference from a performance perspective. Using this console app I tested the performance of Lucene.NET Beta 16, vs, the current master vs, [PR#940](https://github.com/apache/lucenenet/pull/940) and I get more or less the same results from each. I tried different levels of parallelism to see how that affected performance and also tried different levels of `RAMBufferSizeMB`. In the process, I came to realize that by default Lucen.NET 4.8 will use a MAX of 8 threads for creating index segments via DWTP threads when using the IndexWriter. At first, I thought perhaps this was the issue and a larger value just needed to be passed via the ` IndexWriterConfig.MaxThreadStates` when the IndexWriter is created. But that didn’t turn out to be true. In fact, on my hardware I ultimately got faster performance by passing a SMALLER value then the default for the `indexWriterConfig.MaxThreadStates` value. I’m providing this full working .NET console app that uses .NET 8 including the data used and solution file and structure. So you can easily run this on your hardware and try different values for the number of threads used to read the data files, the `MaxThreadStates`, and the `RAMBufferSizeMB` value. What I’m most interested in, is having you port this code to java and run it using the exact same dataset in a java environment. The solution with all the data was 412MB which GitHub wouldn't let me upload (it has a 25MB limit) so I trimmed the solution down to one small data file and one big one and then when the app runs it will create the other 20 small data files and 20 big ones on the fly. This got the app just barely under the upload limit. :-) Ideally run the console using the big data set, which is how I have the app configured by default. The app auto-deletes the LuceneIndex folder before each run. So you can run it multiple times in a row and you will be left with just the LuceneIndex from the last run in the LuceneIndex folder. I would love to hear back your Lucene.NET 4.8 results vs Java results and have you contribute back your java code port of this console app. I'd also like to understand exactly which version of Java Lucene you are benchmarking against. On my machine, Lucene.NET can index the 5.65 million book records in 47 seconds, which is a rate of approximately 119K records per second. That average time includes the final commit time. To my eyes that seems pretty darn fast. But if the Java version can really index these same documents much faster that would be great to know. [LuceneIndexingPerformane.zip](https://github.com/user-attachments/files/15877758/LuceneIndexingPerformane.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org