For indexing e-mail, I recommend that you tokenise the e-mail addresses into fragments and query on the fragments as whole terms rather than using wildcards.
Rather than looking for fischauto333* in (say) smtp-from, look for fischauto333 in (say) an additional field called smtp-from-fragments to match the term fischauto333 (it also contains the terms yahoo.de and yahoo and de). Having whole e-mail addresses in terms and using prefix/wildcard queries inevitably results in too many clauses. -----Original Message----- From: Joe [mailto:[EMAIL PROTECTED] Sent: 09 March 2007 12:08 To: java-user@lucene.apache.org Subject: WilcardQuery and memory Hi, Here we use lucene to index our emails, currently 500.000 Documents. When Searching the body by a WildcardQuery the problems arises. I did some profiling with JProfiler. I see the more BooleanClause instances used the more memory is required during search. Most memory is used by instances of classes SegementTermDocs and CoumpundFileRader.CSIndexInput Nearly 30 MB/1000 BooleanClause instances. So when i limit the BooleanClauses by BooleanQuery.setMaxClauseCount(5000); and more and more emails gets indexed, will this lead to a OutOfMemoryError or will the 30MB/1000 Clauses ratio still hold? What should i do to prevent OOM without reducing BooleanQuery.setMaxClauseCount() to much? ___________________________________________________________ Der frühe Vogel fängt den Wurm. Hier gelangen Sie zum neuen Yahoo! Mail: http://mail.yahoo.de --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
smime.p7s
Description: S/MIME cryptographic signature