Hi Grant, I tried with IndexReader and got around 2x boost in speed. i.e around 200K lookups/s as compared to hashmap which is 600K lookups/s I cant seem to reuse Term object which is a major bottleneck. Also TermPositions wasnt able to give me the docid, it did give the payload in the form of a bytearray which i have no idea how to decipher so I stuck to TermDocs instead
Here is the code private static final String WORD = "word"; private IndexWriter iwriter; private IndexReader ireader; private RAMDirectory idx = new RAMDirectory(); private Analyzer analyzer = new KeywordAnalyzer(); private Document doc = new Document(); private Field wordField = new Field(WORD, "", Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS); private Term queryTerm = new Term(WORD, ""); public void readyForImport() throws Exception { this.iwriter = new IndexWriter(idx, analyzer, true, new NoDeletionPolicy(), IndexWriter.MaxFieldLength.LIMITED); this.iwriter.setMaxFieldLength(200); this.iwriter.setMaxMergeDocs(10000000); this.iwriter.setUseCompoundFile(false); doc.add(wordField); } public void destroy() throws Exception { ireader.close(); iwriter.close(); } public void readyForRead() throws Exception { iwriter.optimize(); iwriter.close(); this.ireader = IndexReader.open(idx, true); } public void addToDictionary(String word, int id) throws IOException { if (id < 0) throw new IllegalArgumentException("ID cannot be negative"); wordField.setValue(word); iwriter.addDocument(doc); } public int get(String word) throws IOException { Term t = queryTerm.createTerm(word); TermDocs docs = ireader.termDocs(t); if(docs.next() == false) return -1; return docs.doc(); } On Sun, Jan 17, 2010 at 6:36 AM, Robin Anil <robin.a...@gmail.com> wrote: > > On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll <gsing...@apache.org>wrote: > >> On the indexing side, add in batches and reuse the document and fields. >> > Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing > max merge docs. > >> >> On the search side, no need for a BooleanQuery and no need for scoring, so >> you will likely want your own Collector (dead simple to write). >> > bought it down to 15 secs from 30 for 1mil lookups using TermQuery and > Collector which is instantiated at once > > >> >> It _MAY_ even be faster to simply do the indexing as a word w/ the id as a >> payload and then use TermPositions (and no query at all) and forgo searching >> all together. Then you just need an IndexReader. First search will always >> be slow, unless you "warm" it first. This should help avoid the cost of >> going to document storage, which is almost always the most expensive thing >> one does in Lucene do to it's random nature. Might even be beneficial to be >> able to retrieve IDs in batches (sorted lexicographically, too). >> > > Since all the words have unique ids' then i dont think there is any need > for assigning ids. Will re-use lucene document id. > Testing shows that it decreased index time to 13 sec and lookup time to 11 > sec > > But I still dont get the "not searching" part. Will take a look at > TermPosition and how its done. > >> >> Don't get me wrong, it will likely be slower than a hash map, but the hash >> map won't scale and the Lucene term dictionary is delta encoded, so it will >> compress a fair amount. Also, as you grow, you will need to use an >> FSDirectory. > > I stil havent seen the size diff for what I was doing previously. But after > I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as > compared to a HashMap. > > with 5 mil words and 10mil lookups > Hashmap is 4x faster in ADD and 6x faster in lookup. > Inmemory Lucene dict gives around 100K lookups per second. Which is like > 1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then > again, it just need to match lucene Analyzer's speed with which tokens are > processed. > > > > > > >> -Grant >> >> On Jan 16, 2010, at 5:37 PM, Robin Anil wrote: >> >> > Here is my attempt at making a dictionary lookup using lucene. Need some >> > pointers in optimising. Currently it takes 30 secs for a million lookups >> > using a dictionary of 500K words about 30x of that of a hashmap. But >> space >> > used is almost same as far as i can see in memory sizes looks almost the >> > same(from the process manager). >> > >> > >> > private static final String ID = "id"; >> > private static final String WORD = "word"; >> > private IndexWriter iwriter; >> > private IndexSearcher isearcher; >> > private RAMDirectory idx = new RAMDirectory(); >> > private Analyzer analyzer = new WhitespaceAnalyzer(); >> > >> > public void init() throws Exception { >> > this.iwriter = >> > new IndexWriter(idx, analyzer, true, >> > IndexWriter.MaxFieldLength.LIMITED); >> > >> > } >> > >> > public void destroy() throws Exception { >> > iwriter.close(); >> > isearcher.close(); >> > } >> > >> > public void ready() throws Exception { >> > iwriter.optimize(); >> > iwriter.close(); >> > >> > this.isearcher = new IndexSearcher(idx, true); >> > } >> > >> > public void addToDictionary(String word, Integer id) throws IOException >> { >> > >> > Document doc = new Document(); >> > doc.add(new Field(WORD, word, Field.Store.NO, >> > Field.Index.NOT_ANALYZED)); >> > doc.add(new Field(ID, id.toString(), Store.YES, >> > Field.Index.NOT_ANALYZED)); >> > ?? Is there a way other than storing the id as string ? >> > iwriter.addDocument(doc); >> > } >> > >> > public Integer get(String word) throws IOException, ParseException { >> > BooleanQuery query = new BooleanQuery(); >> > query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD); >> > TopDocs top = isearcher.search(query, null, 1); >> > ScoreDoc[] hits = top.scoreDocs; >> > if (hits.length == 0) return null; >> > return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID)); >> > } >> > >> > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org >> >wrote: >> > >> >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off >> will >> >> be very efficient. Plus, there is virtually no code to write. I've >> seen >> >> bare bones indexes be as little as 20% of the original w/ very fast >> lookup. >> >> Furthermore, there are many options available for controlling how much >> is >> >> loaded into memory, etc. Finally, it will handle all the languages you >> >> throw at it. >> >> >> >> -Grant >> >> >> >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: >> >> >> >>> Currently java strings use double the space of the characters in it >> >> because >> >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB >> >> when >> >>> loaded into a HashMap<String, Integer>. Is there some optimization we >> >> could >> >>> do in terms of storing them and ensuring that chinese, devanagiri and >> >> other >> >>> characters dont get messed up in the process. >> >>> >> >>> Some options benson suggested was: storing just the byte[] form and >> >> adding >> >>> the the option of supplying the hash function in OpenObjectIntHashmap >> or >> >>> even using a UTF-8 string. >> >>> >> >>> Or we could leave this alone. I currently estimate the memory >> requirement >> >>> using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings >> when >> >>> generating the dictionary split for the vectorizer >> >>> >> >>> Robin >> >> >> >> >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> >