You don't need to worry about the 1024 maxBooleanClauses, just use a TermsFilter.

https://lucene.apache.org/core/4_8_0/queries/org/apache/lucene/queries/TermsFilter.html

I use it for a similar scenario, where we have a data structure that determines a subset of 1.5 million documents from outside Lucene. And to make the search (much) faster, I convert a list of ID's (primary key in the database) to a bunch of 'id:X' terms.

If you have other criteria (say some category id or other grouped selection) than those id's, you could index those alongside your documents and use TermsFilter's (and/or BooleanFilter with several other filters) to eventually make a pretty fast subset-selection.

It'll not be faster than having a dedicated 500-document database, but if you have to recreate that on the fly... I'd expect you to easily beat (with a few orders of magnitude) the time of the total procedure.

Best regards,

Arjen

On 26-5-2014 18:15 Erick Erickson wrote:
bq: We don’t want to search on the complete document store

Why not? Alexandre's comment is spot on. For 500 docs you could easily
form a filter query like
&fq=id1 OR id2 OR id3.... (solr-style, but easily done in Lucene). You
get these IDs from the DB
search. This will still be MUCH faster than indexing on the fly.

The default maxBooleanClauses of 1024 if just a configuration problem,
I've seen it at 10 times that.

And you could cache the filter if you wanted and that fit your use case.

Unless you _really_ can show that this solution is untenable, I think
you're making this problem far
too hard for yourself.

If you insist on indexing these docs on the fly, you'll have to live
with the performance hit. There's no
real magic bullet to make your indexing sub-second. As others have
said, indexing 500 docs seems
like it shouldn't take as long as you're reporting. I personally
suspect that your problem is
somewhere in the acquisition phase. What happens if you just comment
out all the
code that actually does anything with Lucene and just go through the
motions of getting
the doc from the system-of-record in your code? My bet is that if you
comment out the indexing
part,  you'll find you spend 18 of your 20 seconds (SWAG).

If my bet is correct, then there's _nothing_ you can do to make this
case work as far as Lucene
is concerned; Lucene had nothing to do with the speed issues, it's
acquiring the docs in the first place.

And if I'm wrong, then there's also virtually nothing you can do.
Lucene is fast, very fast. You're
apparently indexing things that are big/complex/whatever.

Really, explain please why indexing all the docs and using a filter of
the IDs from the DB
won't work. This really, really smells like an XY problem and you have
a flawed approach
that is best scrapped.

Best,
Erick


On Mon, May 26, 2014 at 6:08 AM, Alexandre Patry
<alexandre.pa...@keatext.com> wrote:
On 26/05/2014 05:40, Shruthi wrote:

Hi All,

Thanks for the suggestions. But there is a slight difference in the
requirements.
1. We don't  index/ search 10 million documents for a keyword; instead we
do it on only 500 documents because we are supposed to get the final result
only from the 500 set of documents.
2.We have already filtered 500 documents from the 10M+ documents based on
a DB Stored Procedure which has nothing to do with any kind of search
keywords .
3.Our search algorithm plays a vital role on this new set of 500
documents.
4.We can't avoid on the fly indexing because the  document set to be
indexed is random and is ever changing .
         Although we can index the existing 10M+ docs before hand and keep
ready the indexes..We don’t want to search on the complete document store.
Instead we only want to search on the 500 documents got above.

Is there any best alternative to this requirement?

You could index all 10 million documents and use a custom filter[1] with
your queries to specify which 500 documents to look at.

Hope this help,

Alexandre

[1]
http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Filter.html


Thanks,

Shruthi Sethi
SR. SOFTWARE ENGINEER
iMedX
OFFICE:
033-4001-5789 ext. N/A
MOBILE:
91-9903957546
EMAIL:
sse...@imedx.com
WEB:
www.imedx.com



-----Original Message-----
From: shashi....@gmail.com [mailto:shashi....@gmail.com] On Behalf Of
Shashi Kant
Sent: Saturday, May 24, 2014 5:55 AM
To: java-user@lucene.apache.org
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server

To 2nd  Vitaly's suggestion. You should consider using Apache Solr
instead - it handles such issues OOTB .


On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vfunst...@gmail.com>
wrote:

At the risk of sounding overly critical here, I would say you need to
scrap
your entire approach of building one small index per request, and just
build your entire searchable data store in Lucene/Solr. This is the
simplest and probably most maintainable and scalable solution. Even if
your
index contains 10M+ documents, returning at most 500 search results
should
be lightning fast compared to the latencies you're seeing right now. To
facilitate data export from the DB, take a look at this:
http://wiki.apache.org/solr/DataImportHandler


On Tue, May 20, 2014 at 7:36 AM, Shruthi <sse...@imedx.com> wrote:




-----Original Message-----
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
Sent: Tuesday, May 20, 2014 3:48 PM
To: java-user@lucene.apache.org
Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit
server

On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:

Toke:

Is 20 second an acceptable response time for your users?

Shruthi: Its definitely not acceptable. PFA the piece of code that we
are using..Its taking 20seconds. That’s why I drafted this ticket to
see where I was going wrong.

Indexing 1000 documents/sec in Lucene is quite common, so even taking
into account large documents, 20 seconds sounds like quite a bit.
Shruthi: I had attached the code snippet in previous mail. Do you
suspect
a foul play there?

Shruthi: Well,  its two stage process: Client is looking at
historical data based on a parameters like names, dates,MRN, fields
etc.. SO the query actually gets the data set fulfilling the
requirements

If client is interested in doing a text search then he would pass the
search phrase on the result set.

So it is not possible for a client to perform a broad phrase search to
start with. And it sounds like your DB-queries are all simple matching?
No complex joins and such? If so, this calls even more for a full
Lucene-index solution, which handles all aspect of the search process.
Shruthi: We call a DB stored procedure to get us the result set for
working with..
We will be using highlighter API and  I don’t think Memory  index can be
used with highlighter.

- Toke Eskildsen, State and University Library, Denmark




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






--
Alexandre Patry, Ph.D
Chercheur / Researcher
http://KeaText.com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to