i haved a look to the exception, and see that hibernate is using a function called updateTopDocs to make the search, this function uses a filter, and this may add some more memory requeriments, could you create a filter also with the lucene standalone version? something like this before the search: Queryfilter queryfilt = new QueryFilter(query); and the pass the filter to the search.
However, with such a huge index I think you must find the way to get some more memory for the java heap. hibernate queryhits: http://viewvc.jboss.org/cgi-bin/viewvc.cgi/hibernate/search/trunk/src/java/org/hibernate/search/query/QueryHits.java?view=markup&pathrev=15603 On Wed, Mar 11, 2009 at 4:31 PM, Michael McCandless < [email protected]> wrote: > > Unfortunately, I'm not familiar with exactly what Hibernate search does > with the Lucene APIs. > > It must be doing something beyond what your standalone Lucene test case > does. > > Maybe ask this question on the Hibernate list? > > > Mike > > [email protected] wrote: > > Thanks for the advice. >> >> I haven't got around to profiling the code. Instead, I took your advice >> and knocked Hibernate out of the equation with a small stand-alone program >> that calls Lucene directly. I then wrote a similar stand-alone using >> Hibernate Search to do the same thing. >> >> On a small index both work fine: >> >> E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky >> hits = 29410 >> >> E:\hibtest>java -Xmx1200m -classpath >> .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar >> hibtest >> size = 29410 >> >> >> Trying it on our huge index works for the straight Lucene version: >> >> E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky >> hits = 320500 >> >> >> but fails for the Hibernate version: >> >> E:\hibtest>java -Xmx1200m -classpath >> .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar >> hibtest >> Exception in thread "main" java.lang.OutOfMemoryError >> at java.io.RandomAccessFile.readBytes(Native Method) >> at java.io.RandomAccessFile.read(Unknown Source) >> at >> org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirec >> tory.java:596) >> at >> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp >> ut.java:136) >> at >> org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal( >> CompoundFileReader.java:247) >> at >> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp >> ut.java:136) >> at >> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp >> ut.java:92) >> at >> org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907) >> at >> org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j >> ava:352) >> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) >> at >> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:6 >> 9) >> at >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) >> >> at >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) >> >> at org.apache.lucene.search.Searcher.search(Searcher.java:136) >> at >> org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100 >> ) >> at org.hibernate.search.query.QueryHits.<init>(QueryHits.java:61) >> at >> org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue >> ryImpl.java:354) >> at >> org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu >> eryImpl.java:741) >> at hibtest.main(hibtest.java:45) >> >> E:\hibtest> >> >> >> I am not sure why this is occuring. Any ideas? I am calling >> IndexSearcher.search() and so is Hibernate. Is Hibernate Search telling >> Lucene to try and read in the entire index into memory? >> >> >> Code for the Lucene version is: >> >> public class lucky >> { >> public static void main(String[] args) >> { >> try >> { >> Term term = new Term("value", "church"); >> Query query = new TermQuery(term); >> IndexSearcher searcher = new >> IndexSearcher("E:/lucene/indexes/uk.bl.dportal.marcdb.MarcText"); >> Hits hits = searcher.search(query); >> >> System.out.println("hits = "+hits.length()); >> >> searcher.close(); >> } >> catch (Exception e) >> { >> e.printStackTrace(); >> } >> } >> } >> >> and for the Hibernate Search version: >> >> public class hibtest { >> >> public static void main(String[] args) { >> hibtest mgr = new hibtest(); >> >> Session session = >> HibernateUtil.getSessionFactory().getCurrentSession(); >> >> session.beginTransaction(); >> >> FullTextSession fullTextSession = >> Search.getFullTextSession(session); >> TermQuery luceneQuery = new TermQuery(new Term("value", "church")); >> >> org.hibernate.search.FullTextQuery fullTextQuery = >> fullTextSession.createFullTextQuery( luceneQuery, MarcText.class ); >> >> long resultSize = fullTextQuery.getResultSize(); // this is line 45 >> >> System.out.println("size = "+resultSize); >> >> session.getTransaction().commit(); >> >> HibernateUtil.getSessionFactory().close(); >> } >> >> } >> >> >> >> Quoting Michael McCandless <[email protected]>: >> >> >>> At this point, I'd recommend running with a memory profiler, eg >>> YourKit, and posting the resulting output. >>> >>> With norms only on one field, no deletions, and no field sorting, I >>> can't see why you're running out of memory. >>> >>> If you take Hibernate out of the picture, and simply open an >>> IndexSearcher on the underlying index, do you still hit OOM? >>> >>> Can you post the output of CheckIndex? You can run it from the command >>> line: >>> >>> java org.apache.lucene.index.CheckIndex <pathToIndex> >>> >>> (Without -fix, CheckIndex will make no changes to the index, but it's >>> best to do this on a copy of the index to be supremely safe). >>> >>> Mike >>> >>> [email protected] wrote: >>> >>> Thanks Michael, >>>> >>>> There is no sorting on the result (adding a sort causes OOM well before >>>> the point it runs out for the default). >>>> >>>> There are no deleted docs - the index was created from a set of docs >>>> and no adds or deletes have taken place. >>>> >>>> Memory isn't being consumed elsewhere in the system. It all comes down >>>> to the Lucene call via Hibernate Search. We decided to split our huge >>>> index >>>> into a set of several smaller indexes. Like the original single index, >>>> each >>>> smaller index has one field which is tokenized and the other fields have >>>> NO_NORMS set. >>>> >>>> The following, explicitely specifying just one index, works fine: >>>> >>>> org.hibernate.search.FullTextQuery fullTextQuery = >>>> fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class ); >>>> >>>> But as soon as we start adding further indexes: >>>> >>>> org.hibernate.search.FullTextQuery fullTextQuery = >>>> fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class, >>>> MarcText8.class ); >>>> >>>> We start running into OOM. >>>> >>>> In our case the MarcText2 index has a total disk size of 5Gb (with >>>> 57589069 documents / 75491779 terms) and MarcText8 has a total size of >>>> 6.46Gb (with 79339982 documents / 104943977 terms). >>>> >>>> Adding all 8 indexes (the same as our original single index), either by >>>> explicitely naming them or just with: >>>> >>>> org.hibernate.search.FullTextQuery fullTextQuery = >>>> fullTextSession.createFullTextQuery( outerLuceneQuery); >>>> >>>> results in it becoming completely unusable. >>>> >>>> >>>> One thing I am not sure about is that in Luke it tells me for an index >>>> (neither of the indexes mentioned above) that was created with NO_NORMS >>>> set >>>> on all the fields: >>>> >>>> "Index functionality: lock-less, single norms, shared doc store, >>>> checksum, del count, omitTf" >>>> >>>> Is this correct? I am not sure what it means by "single norms" - I >>>> would have expected it to say "no norms". >>>> >>>> >>>> Any further ideas on where to go from here? Your estimate of what is >>>> loaded into memory suggests that we shouldn't really be anywhere near >>>> running out of memory with these size indexes! >>>> >>>> As I said in my OP, Luke also gets a heap error on searching our >>>> original single large index which makes me wonder if it is a problem with >>>> the construction of the index. >>>> >>>> >>>> >>>> Quoting Michael McCandless <[email protected]>: >>>> >>>> >>>>> Lucene is trying to allocate the contiguous norms array for your index, >>>>> which should be ~273 MB (=286/1024/1024), when it hits the OOM. >>>>> >>>>> Is your search sorting by field value? (Which'd also consume memory.) >>>>> Or it's just the default (by relevance) sort? >>>>> >>>>> The only other biggish consumer of memory should be the deleted docs, >>>>> but that's a BitVector so it should need ~34 MB RAM. >>>>> >>>>> Can you run a memory profiler to see what else is consuming RAM? >>>>> >>>>> Mike >>>>> >>>>> [email protected] wrote: >>>>> >>>>> Hello, >>>>>> >>>>>> I am using Lucene via Hibernate Search but the following problem is >>>>>> also seen using Luke. I'd appreciate any suggestions for solving this >>>>>> problem. >>>>>> >>>>>> I have a Lucene index (27Gb in size) that indexes a database table >>>>>> of 286 million rows. While Lucene was able to perform this indexing >>>>>> just >>>>>> fine (albeit very slowly), using the index has proved to be impossible. >>>>>> Any searches conducted on it, either from my Hibernate Search query or >>>>>> by >>>>>> placing the query into Luke give: >>>>>> >>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) >>>>>> at >>>>>> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69) >>>>>> at >>>>>> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230) >>>>>> at >>>>>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) >>>>>> ... >>>>>> >>>>>> >>>>>> The type of queries are simple, of the form: >>>>>> >>>>>> (+value:church +marcField:245 +subField:a) >>>>>> >>>>>> which in this example should only return a few thousand results. >>>>>> >>>>>> >>>>>> The interpreter is already running with the maximum of heap space >>>>>> allowed on for the Java executable running on Windows XP ( java -Xms >>>>>> 1200m >>>>>> -Xmx 1200m) >>>>>> >>>>>> >>>>>> The Lucene index was created using the following Hibernate Search >>>>>> annotations: >>>>>> >>>>>> @Column >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private Integer marcField; >>>>>> >>>>>> @Column (length = 2) >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private String subField; >>>>>> >>>>>> @Column(length = 2) >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private String indicator1; >>>>>> >>>>>> @Column(length = 2) >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private String indicator2; >>>>>> >>>>>> @Column(length = 10000) >>>>>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, >>>>>> store=Store.NO) >>>>>> private String value; >>>>>> >>>>>> @Column >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private Integer recordId; >>>>>> >>>>>> >>>>>> So all of the fields have NO NORMS except for "value" which is >>>>>> contains description text that needs to be tokenised. >>>>>> >>>>>> Is there any way around this? Does Lucene really have such a low >>>>>> limit for how much data it can search (and I consider 286 million >>>>>> documents to be pretty small beer - we were hoping to index a table of >>>>>> over a billion rows)? Or is there something I'm missing? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> >> >> >> > -- Jokin
