Hi Eric, Here are a few statistics we got lately. It takes 11 secs on a 64 bit server using RAMDirectory impl for 500 RTF documents to get indexed on the fly. It takes 25 seconds for the same scenario if each of the RTF is converted to text file using Aspose and then indexing the text files. We then resorted to Apache Tika Tool Kit but somehow its failing for 1% of RTF's that we have so haven’t yet got confidence on the tool kit yet.(Those 1% of files were successfully parsed through Aspose)
We did a detailed analysis for each step and observed that indexing per RTF file(i.e using path and content(with File Reader)) happened at the same millisecond and On an average it took 95millisec for each file to get indexed and took anywhere between 200 to 500millisec for file to get converted to text using Aspose. Thanks, Shruthi Sethi SR. SOFTWARE ENGINEER iMedX OFFICE: 033-4001-5789 ext. N/A MOBILE: 91-9903957546 EMAIL: sse...@imedx.com WEB: www.imedx.com -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, May 26, 2014 9:46 PM To: java-user Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server bq: We don’t want to search on the complete document store Why not? Alexandre's comment is spot on. For 500 docs you could easily form a filter query like &fq=id1 OR id2 OR id3.... (solr-style, but easily done in Lucene). You get these IDs from the DB search. This will still be MUCH faster than indexing on the fly. The default maxBooleanClauses of 1024 if just a configuration problem, I've seen it at 10 times that. And you could cache the filter if you wanted and that fit your use case. Unless you _really_ can show that this solution is untenable, I think you're making this problem far too hard for yourself. If you insist on indexing these docs on the fly, you'll have to live with the performance hit. There's no real magic bullet to make your indexing sub-second. As others have said, indexing 500 docs seems like it shouldn't take as long as you're reporting. I personally suspect that your problem is somewhere in the acquisition phase. What happens if you just comment out all the code that actually does anything with Lucene and just go through the motions of getting the doc from the system-of-record in your code? My bet is that if you comment out the indexing part, you'll find you spend 18 of your 20 seconds (SWAG). If my bet is correct, then there's _nothing_ you can do to make this case work as far as Lucene is concerned; Lucene had nothing to do with the speed issues, it's acquiring the docs in the first place. And if I'm wrong, then there's also virtually nothing you can do. Lucene is fast, very fast. You're apparently indexing things that are big/complex/whatever. Really, explain please why indexing all the docs and using a filter of the IDs from the DB won't work. This really, really smells like an XY problem and you have a flawed approach that is best scrapped. Best, Erick On Mon, May 26, 2014 at 6:08 AM, Alexandre Patry <alexandre.pa...@keatext.com> wrote: > On 26/05/2014 05:40, Shruthi wrote: >> >> Hi All, >> >> Thanks for the suggestions. But there is a slight difference in the >> requirements. >> 1. We don't index/ search 10 million documents for a keyword; instead we >> do it on only 500 documents because we are supposed to get the final result >> only from the 500 set of documents. >> 2.We have already filtered 500 documents from the 10M+ documents based on >> a DB Stored Procedure which has nothing to do with any kind of search >> keywords . >> 3.Our search algorithm plays a vital role on this new set of 500 >> documents. >> 4.We can't avoid on the fly indexing because the document set to be >> indexed is random and is ever changing . >> Although we can index the existing 10M+ docs before hand and keep >> ready the indexes..We don’t want to search on the complete document store. >> Instead we only want to search on the 500 documents got above. >> >> Is there any best alternative to this requirement? > > You could index all 10 million documents and use a custom filter[1] with > your queries to specify which 500 documents to look at. > > Hope this help, > > Alexandre > > [1] > http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Filter.html >> >> >> Thanks, >> >> Shruthi Sethi >> SR. SOFTWARE ENGINEER >> iMedX >> OFFICE: >> 033-4001-5789 ext. N/A >> MOBILE: >> 91-9903957546 >> EMAIL: >> sse...@imedx.com >> WEB: >> www.imedx.com >> >> >> >> -----Original Message----- >> From: shashi....@gmail.com [mailto:shashi....@gmail.com] On Behalf Of >> Shashi Kant >> Sent: Saturday, May 24, 2014 5:55 AM >> To: java-user@lucene.apache.org >> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server >> >> To 2nd Vitaly's suggestion. You should consider using Apache Solr >> instead - it handles such issues OOTB . >> >> >> On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vfunst...@gmail.com> >> wrote: >>> >>> At the risk of sounding overly critical here, I would say you need to >>> scrap >>> your entire approach of building one small index per request, and just >>> build your entire searchable data store in Lucene/Solr. This is the >>> simplest and probably most maintainable and scalable solution. Even if >>> your >>> index contains 10M+ documents, returning at most 500 search results >>> should >>> be lightning fast compared to the latencies you're seeing right now. To >>> facilitate data export from the DB, take a look at this: >>> http://wiki.apache.org/solr/DataImportHandler >>> >>> >>> On Tue, May 20, 2014 at 7:36 AM, Shruthi <sse...@imedx.com> wrote: >>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] >>>> Sent: Tuesday, May 20, 2014 3:48 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit >>>> server >>>> >>>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote: >>>> >>>> Toke: >>>>> >>>>> Is 20 second an acceptable response time for your users? >>>>> >>>>> Shruthi: Its definitely not acceptable. PFA the piece of code that we >>>>> are using..Its taking 20seconds. That’s why I drafted this ticket to >>>>> see where I was going wrong. >>>> >>>> Indexing 1000 documents/sec in Lucene is quite common, so even taking >>>> into account large documents, 20 seconds sounds like quite a bit. >>>> Shruthi: I had attached the code snippet in previous mail. Do you >>>> suspect >>>> a foul play there? >>>> >>>>> Shruthi: Well, its two stage process: Client is looking at >>>>> historical data based on a parameters like names, dates,MRN, fields >>>>> etc.. SO the query actually gets the data set fulfilling the >>>>> requirements >>>>> >>>>> If client is interested in doing a text search then he would pass the >>>>> search phrase on the result set. >>>> >>>> So it is not possible for a client to perform a broad phrase search to >>>> start with. And it sounds like your DB-queries are all simple matching? >>>> No complex joins and such? If so, this calls even more for a full >>>> Lucene-index solution, which handles all aspect of the search process. >>>> Shruthi: We call a DB stored procedure to get us the result set for >>>> working with.. >>>> We will be using highlighter API and I don’t think Memory index can be >>>> used with highlighter. >>>> >>>> - Toke Eskildsen, State and University Library, Denmark >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >> >> > > > -- > Alexandre Patry, Ph.D > Chercheur / Researcher > http://KeaText.com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org