Re: Efficient dictionary storage in memory

Grant Ingersoll Sat, 16 Jan 2010 15:23:43 -0800

On the indexing side, add in batches and reuse the document and fields.

On the search side, no need for a BooleanQuery and no need for scoring, so you 
will likely want your own Collector (dead simple to write).


It _MAY_ even be faster to simply do the indexing as a word w/ the id as a 
payload and then use TermPositions (and no query at all) and forgo searching 
all together.  Then you just need an IndexReader.  First search will always be 
slow, unless you "warm" it first.  This should help avoid the cost of going to 
document storage, which is almost always the most expensive thing one does in 
Lucene do to it's random nature.  Might even be beneficial to be able to 
retrieve IDs in batches (sorted lexicographically, too).

Don't get me wrong, it will likely be slower than a hash map, but the hash map 
won't scale and the Lucene term dictionary is delta encoded, so it will 
compress a fair amount.  Also, as you grow, you will need to use an FSDirectory.

-Grant

On Jan 16, 2010, at 5:37 PM, Robin Anil wrote:

> Here is my attempt at making a dictionary lookup using lucene. Need some
> pointers in optimising. Currently it takes 30 secs for a million lookups
> using a dictionary of 500K words about 30x of that of a hashmap. But space
> used is almost same as far as i can see in memory sizes looks almost the
> same(from the process manager).
> 
> 
> private static final String ID = "id";
>  private static final String WORD = "word";
>  private IndexWriter iwriter;
>  private IndexSearcher isearcher;
>  private RAMDirectory idx = new RAMDirectory();
>  private Analyzer analyzer = new WhitespaceAnalyzer();
> 
>  public void init() throws Exception {
>    this.iwriter =
>        new IndexWriter(idx, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
> 
>  }
> 
>  public void destroy() throws Exception {
>    iwriter.close();
>    isearcher.close();
>  }
> 
>  public void ready() throws Exception {
>    iwriter.optimize();
>    iwriter.close();
> 
>    this.isearcher = new IndexSearcher(idx, true);
>  }
> 
>  public void addToDictionary(String word, Integer id) throws IOException {
> 
>    Document doc = new Document();
>    doc.add(new Field(WORD, word, Field.Store.NO,
> Field.Index.NOT_ANALYZED));
>    doc.add(new Field(ID, id.toString(), Store.YES,
> Field.Index.NOT_ANALYZED));
> ?? Is there a way other than storing the id as string ?
>    iwriter.addDocument(doc);
>  }
> 
>  public Integer get(String word) throws IOException, ParseException {
>    BooleanQuery query = new BooleanQuery();
>    query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
>    TopDocs top = isearcher.search(query, null, 1);
>    ScoreDoc[] hits = top.scoreDocs;
>    if (hits.length == 0) return null;
>    return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
>  }
> 
> On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org>wrote:
> 
>> A Lucene index, w/ no storage, positions, etc. (optionally) turned off will
>> be very efficient.  Plus, there is virtually no code to write.  I've seen
>> bare bones indexes be as little as 20% of the original w/ very fast lookup.
>> Furthermore, there are many options available for controlling how much is
>> loaded into memory, etc.  Finally, it will handle all the languages you
>> throw at it.
>> 
>> -Grant
>> 
>> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
>> 
>>> Currently java strings use double the space of the characters in it
>> because
>>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB
>> when
>>> loaded into a HashMap<String, Integer>.  Is there some optimization we
>> could
>>> do in terms of storing them and ensuring that chinese, devanagiri and
>> other
>>> characters dont get messed up in the process.
>>> 
>>> Some options benson suggested was: storing just the byte[] form and
>> adding
>>> the the option of supplying the hash function in OpenObjectIntHashmap or
>>> even using a UTF-8 string.
>>> 
>>> Or we could leave this alone. I currently estimate the memory requirement
>>> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
>>> generating the dictionary split for the vectorizer
>>> 
>>> Robin
>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Efficient dictionary storage in memory

Reply via email to