rclabo commented on issue #935:
URL: https://github.com/apache/lucenenet/issues/935#issuecomment-2174514794

   @superkelvint Thank you for providing some code to show how you are using 
Lucene.NET.  That helps a lot.  Maybe we can work together to make sense out of 
what you are seeing.
   
   One challenge in creating any kind of benchmark is having enough data to 
index so that it takes some time for LuceneNET to complete. LuceneNET is very 
fast, so thousands of records of data are needed.  I went on a hunt for an 
open-source data set that we could use. Ultimately I chose the [Book Dataset 
from Kaggle.com]( 
https://www.kaggle.com/datasets/saurabhbagchi/books-dataset/data), which is 
available under a CC0 public domain license.
   
   The dataset has approx. 271K records and each record contains several fields 
(Book Title, Author, URL to cover photo, etc.).  To allow the program to load 
data in parallel, I created 21 copies of this data which is 5.4 million records 
total.  Because sometimes it’s nice to be able to run the program more quickly 
when tweaking settings to see if they matter much, I also create 21 “small” 
data files, which contain about 50,000 records each for a total of 1.05 million 
records total.
   
   A single flag at the top of the program causes it to index big or small 
files. Either way, it’s indexing 21 files.
   I wrote a `DataReader` class that can read one record at a time from one of 
these data files. It uses a `Stream`, which in .NET is a performant approach. 
The implementation uses synchronous IO, but I tried a version of the app with 
async IO (async await) for reading these data files, and it made no difference 
from a performance perspective. 
   
   Using this console app I tested the performance of Lucene.NET Beta 16, vs, 
the current master vs, [PR#940](https://github.com/apache/lucenenet/pull/940) 
and I get more or less the same results from each.
   
   I tried different levels of parallelism to see how that affected performance 
and also tried different levels of `RAMBufferSizeMB`.  In the process, I came 
to realize that by default Lucen.NET 4.8 will use a MAX of 8 threads for 
creating index segments via DWTP threads when using the IndexWriter. At first, 
I thought perhaps this was the issue and a larger value just needed to be 
passed via the ` IndexWriterConfig.MaxThreadStates` when the IndexWriter is 
created.  But that didn’t turn out to be true.  In fact, on my hardware I 
ultimately got faster performance by passing a SMALLER value then the default 
for the `indexWriterConfig.MaxThreadStates` value.  
   
   I’m providing this full working .NET console app that uses .NET 8 including 
the data used and solution file and structure.  So you can easily run this on 
your hardware and try different values for the number of threads used to read 
the data files, the `MaxThreadStates`, and the `RAMBufferSizeMB` value. What 
I’m most interested in, is having you port this code to java and run it using 
the exact same dataset in a java environment.  The solution with all the data 
was 412MB which GitHub wouldn't let me upload (it has a 25MB limit) so I 
trimmed the solution down to one small data file and one big one and then when 
the app runs it will create the other 20 small data files and 20 big ones on 
the fly.  This got the app just barely under the upload limit. :-)
   
   Ideally run the console using the big data set, which is how I have the app 
configured by default.  The app auto-deletes the LuceneIndex folder before each 
run.  So you can run it multiple times in a row and you will be left with just 
the LuceneIndex from the last run in the LuceneIndex folder.
   
   I would love to hear back your Lucene.NET 4.8 results vs Java results and 
have you contribute back your java code port of this console app.  I'd also 
like to understand exactly which version of Java Lucene you are benchmarking 
against.
   
   On my machine, Lucene.NET can index the 5.65 million book records in 47 
seconds, which is a rate of approximately 119K records per second.  That 
average time includes the final commit time.  
   
   To my eyes that seems pretty darn fast.  But if the Java version can really 
index these same documents much faster that would be great to know.
   
   
[LuceneIndexingPerformane.zip](https://github.com/user-attachments/files/15877758/LuceneIndexingPerformane.zip)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to