Hi Rob, The statistics which I had shared were provided using one thread for indexing. I wish to use only 1 thread and want to process maximum 10MBps(Mega Bytes per second) of data rate. I believe with single thread it should be achievable.
Regards, Sandeep On Tue, Feb 23, 2016 at 12:50 PM, Rob Audenaerde <[email protected]> wrote: > Hi Sandeep, > > How many threads do you use to do the indexing? The benchmarks of Lucene > are done on >20 threads IIRC. > > -Rob > > On Tue, Feb 23, 2016 at 8:01 AM, sandeep das <[email protected]> wrote: > > > Hi, > > > > I've implemented a tool using lucene-5.2.0 to index my CSV files. The > tool > > is reading data from CSV files(residing on disk) and creating indexes on > > local disk. It is able to process 3.5 MBps data. There are overall 46 > > fields being added in one document. They are only of three data types 1. > > Integer, 2. Long, 3. String. > > All these fields are part of one CSV record and they are parsed using > > custom CSV parser which is faster than any split method of string. > > > > I've configured the following parameters to create indexWriter > > 1. setOpenMode(OpenMode.CREATE) > > 2. setCommitOnClose(true) > > 3. setRAMBufferSizeMB(512) // Tried 256, 312 as well but performance is > > almost same. > > > > I've read over several blogs that lucene works way faster than these > > figures. So, I thought there are some bottlenecks in my code and profiled > > it using jvisualvm. The application is spending most of the time in > > DefaultIndexChain.processField i.e. 53% of total time. > > > > > > Following is the split of CPU usage in this application: > > 1. reading data from disk is taking 5% of total duration > > 2. adding document is taking 93% of total duration. > > > > - postUpdate -> 12.8% > > - doAfterDocument -> 20.6% > > - updateDocument -> 59.8% > > - finishDocument -> 1.7% > > - finishStoreFields -> 4.8% > > - processFields -> 53.1% > > > > > > I'm also attaching the screen shot of call graph generated by jvisualvm. > > > > I've taken care of following points: > > 1. create only one instance of indexWriter > > 2. create only one instance of document and reuse it through out the life > > time of application > > 3. There will be no update in the documents hence only addDocument is > > invoked. > > Note: After going through the code I found out that addDocument is > > internally calling updateDocument only. Is there any way by which we can > > avoid calling updateDocument and only use addDocument API? > > 4. Using setValue APIs to set the pre created fields and reusing these > > fields to create indexes. > > > > Any tip to improve the performance will be immensely appreciated. > > > > Regards, > > Sandeep > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > >
