Re: problems with large Lucene index

Michael McCandless Mon, 09 Mar 2009 11:46:49 -0700

At this point, I'd recommend running with a memory profiler, egYourKit, and posting the resulting output.

With norms only on one field, no deletions, and no field sorting, Ican't see why you're running out of memory.

If you take Hibernate out of the picture, and simply open anIndexSearcher on the underlying index, do you still hit OOM?

Can you post the output of CheckIndex? You can run it from thecommand line:


  java org.apache.lucene.index.CheckIndex <pathToIndex>

(Without -fix, CheckIndex will make no changes to the index, but it'sbest to do this on a copy of the index to be supremely safe).


Mike

[email protected] wrote:

Thanks Michael,
There is no sorting on the result (adding a sort causes OOM wellbefore the point it runs out for the default).
There are no deleted docs - the index was created from a set of docsand no adds or deletes have taken place.
Memory isn't being consumed elsewhere in the system. It all comesdown to the Lucene call via Hibernate Search. We decided to splitour huge index into a set of several smaller indexes. Like theoriginal single index, each smaller index has one field which istokenized and the other fields have NO_NORMS set.
The following, explicitely specifying just one index, works fine:
org.hibernate.search.FullTextQuery fullTextQuery =fullTextSession.createFullTextQuery( outerLuceneQuery,MarcText2.class );
But as soon as we start adding further indexes:
org.hibernate.search.FullTextQuery fullTextQuery =fullTextSession.createFullTextQuery( outerLuceneQuery,MarcText2.class, MarcText8.class );
We start running into OOM.
In our case the MarcText2 index has a total disk size of 5Gb (with57589069 documents / 75491779 terms) and MarcText8 has a total sizeof 6.46Gb (with 79339982 documents / 104943977 terms).
Adding all 8 indexes (the same as our original single index), eitherby explicitely naming them or just with:
org.hibernate.search.FullTextQuery fullTextQuery =fullTextSession.createFullTextQuery( outerLuceneQuery);
results in it becoming completely unusable.
One thing I am not sure about is that in Luke it tells me for anindex (neither of the indexes mentioned above) that was created withNO_NORMS set on all the fields:
"Index functionality: lock-less, single norms, shared doc store,checksum, del count, omitTf"
Is this correct? I am not sure what it means by "single norms" - Iwould have expected it to say "no norms".
Any further ideas on where to go from here? Your estimate of what isloaded into memory suggests that we shouldn't really be anywherenear running out of memory with these size indexes!
As I said in my OP, Luke also gets a heap error on searching ouroriginal single large index which makes me wonder if it is a problemwith the construction of the index.
Quoting Michael McCandless <[email protected]>:
Lucene is trying to allocate the contiguous norms array for yourindex,
which should be ~273 MB (=286/1024/1024), when it hits the OOM.
Is your search sorting by field value? (Which'd also consumememory.)
Or it's just the default (by relevance) sort?

The only other biggish consumer of memory should be the deleted docs,
but that's a BitVector so it should need ~34 MB RAM.

Can you run a memory profiler to see what else is consuming RAM?

Mike

[email protected] wrote:
Hello,
I am using Lucene via Hibernate Search but the following problemis also seen using Luke. I'd appreciate any suggestions forsolving this problem.
I have a Lucene index (27Gb in size) that indexes a databasetable of 286 million rows. While Lucene was able to perform thisindexing just fine (albeit very slowly), using the index hasproved to be impossible. Any searches conducted on it, eitherfrom my Hibernate Search query or by placing the query into Lukegive:
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271)
at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69)at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230)atorg.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
...


The type of queries are simple, of the form:

(+value:church +marcField:245 +subField:a)

which in this example should only return a few thousand results.
The interpreter is already running with the maximum of heap spaceallowed on for the Java executable running on Windows XP ( java -Xms 1200m -Xmx 1200m)
The Lucene index was created using the following Hibernate Searchannotations:
@Column
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS,store=Store.NO)
private Integer marcField;

@Column (length = 2)
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS,store=Store.NO)
private String subField;

@Column(length = 2)
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS,store=Store.NO)
private String indicator1;

@Column(length = 2)
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS,store=Store.NO)
private String indicator2;

@Column(length = 10000)
@Field(index=org.hibernate.search.annotations.Index.TOKENIZED,store=Store.NO)
private String value;

@Column
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS,store=Store.NO)
private Integer recordId;
So all of the fields have NO NORMS except for "value" which iscontains description text that needs to be tokenised.
Is there any way around this? Does Lucene really have such a lowlimit for how much data it can search (and I consider 286 milliondocuments to be pretty small beer - we were hoping to index atable of over a billion rows)? Or is there something I'm missing?
Thanks.

Re: problems with large Lucene index

Reply via email to