Hi Grant,
I tried with IndexReader and got around 2x boost in speed. i.e around 200K
lookups/s as compared to hashmap which is 600K lookups/s
I cant seem to reuse Term object which is a major bottleneck. Also
TermPositions wasnt able to give me the docid, it did give the payload in
the form of a bytearray which i have no idea how to decipher so I stuck to
TermDocs instead

Here is the code

  private static final String WORD = "word";
  private IndexWriter iwriter;
  private IndexReader ireader;
  private RAMDirectory idx = new RAMDirectory();
  private Analyzer analyzer = new KeywordAnalyzer();
  private Document doc = new Document();
  private Field wordField =
      new Field(WORD, "", Field.Store.NO,
Field.Index.NOT_ANALYZED_NO_NORMS);
  private Term queryTerm = new Term(WORD, "");

  public void readyForImport() throws Exception {
    this.iwriter =
        new IndexWriter(idx, analyzer, true, new NoDeletionPolicy(),
            IndexWriter.MaxFieldLength.LIMITED);
    this.iwriter.setMaxFieldLength(200);
    this.iwriter.setMaxMergeDocs(10000000);
    this.iwriter.setUseCompoundFile(false);
    doc.add(wordField);
  }

  public void destroy() throws Exception {
    ireader.close();
    iwriter.close();
  }

  public void readyForRead() throws Exception {
    iwriter.optimize();
    iwriter.close();
    this.ireader = IndexReader.open(idx, true);
  }

  public void addToDictionary(String word, int id) throws IOException {
    if (id < 0) throw new IllegalArgumentException("ID cannot be negative");
    wordField.setValue(word);
    iwriter.addDocument(doc);
  }

  public int get(String word) throws IOException {
    Term t = queryTerm.createTerm(word);
    TermDocs docs = ireader.termDocs(t);
    if(docs.next() == false) return -1;
    return docs.doc();
 }




On Sun, Jan 17, 2010 at 6:36 AM, Robin Anil <robin.a...@gmail.com> wrote:

>
> On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll <gsing...@apache.org>wrote:
>
>> On the indexing side, add in batches and reuse the document and fields.
>>
> Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing
> max merge docs.
>
>>
>> On the search side, no need for a BooleanQuery and no need for scoring, so
>> you will likely want your own Collector (dead simple to write).
>>
> bought it down to 15 secs from 30 for 1mil lookups using TermQuery and
> Collector which is instantiated at once
>
>
>>
>> It _MAY_ even be faster to simply do the indexing as a word w/ the id as a
>> payload and then use TermPositions (and no query at all) and forgo searching
>> all together.  Then you just need an IndexReader.  First search will always
>> be slow, unless you "warm" it first.  This should help avoid the cost of
>> going to document storage, which is almost always the most expensive thing
>> one does in Lucene do to it's random nature.  Might even be beneficial to be
>> able to retrieve IDs in batches (sorted lexicographically, too).
>>
>
> Since all the words have unique ids' then i dont think there is any need
> for assigning ids. Will re-use lucene document id.
> Testing shows that it decreased index time to 13 sec and lookup time to 11
> sec
>
> But I still dont get the "not searching" part. Will take a look at
> TermPosition and how its done.
>
>>
>> Don't get me wrong, it will likely be slower than a hash map, but the hash
>> map won't scale and the Lucene term dictionary is delta encoded, so it will
>> compress a fair amount.  Also, as you grow, you will need to use an
>> FSDirectory.
>
> I stil havent seen the size diff for what I was doing previously. But after
> I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as
> compared to a HashMap.
>
> with 5 mil words and 10mil lookups
> Hashmap is 4x faster in ADD and 6x faster in lookup.
> Inmemory Lucene dict gives around 100K lookups per second. Which is like
> 1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then
> again, it just need to match lucene Analyzer's speed with which tokens are
> processed.
>
>
>
>
>
>
>> -Grant
>>
>> On Jan 16, 2010, at 5:37 PM, Robin Anil wrote:
>>
>> > Here is my attempt at making a dictionary lookup using lucene. Need some
>> > pointers in optimising. Currently it takes 30 secs for a million lookups
>> > using a dictionary of 500K words about 30x of that of a hashmap. But
>> space
>> > used is almost same as far as i can see in memory sizes looks almost the
>> > same(from the process manager).
>> >
>> >
>> > private static final String ID = "id";
>> >  private static final String WORD = "word";
>> >  private IndexWriter iwriter;
>> >  private IndexSearcher isearcher;
>> >  private RAMDirectory idx = new RAMDirectory();
>> >  private Analyzer analyzer = new WhitespaceAnalyzer();
>> >
>> >  public void init() throws Exception {
>> >    this.iwriter =
>> >        new IndexWriter(idx, analyzer, true,
>> > IndexWriter.MaxFieldLength.LIMITED);
>> >
>> >  }
>> >
>> >  public void destroy() throws Exception {
>> >    iwriter.close();
>> >    isearcher.close();
>> >  }
>> >
>> >  public void ready() throws Exception {
>> >    iwriter.optimize();
>> >    iwriter.close();
>> >
>> >    this.isearcher = new IndexSearcher(idx, true);
>> >  }
>> >
>> >  public void addToDictionary(String word, Integer id) throws IOException
>> {
>> >
>> >    Document doc = new Document();
>> >    doc.add(new Field(WORD, word, Field.Store.NO,
>> > Field.Index.NOT_ANALYZED));
>> >    doc.add(new Field(ID, id.toString(), Store.YES,
>> > Field.Index.NOT_ANALYZED));
>> > ?? Is there a way other than storing the id as string ?
>> >    iwriter.addDocument(doc);
>> >  }
>> >
>> >  public Integer get(String word) throws IOException, ParseException {
>> >    BooleanQuery query = new BooleanQuery();
>> >    query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
>> >    TopDocs top = isearcher.search(query, null, 1);
>> >    ScoreDoc[] hits = top.scoreDocs;
>> >    if (hits.length == 0) return null;
>> >    return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
>> >  }
>> >
>> > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll <gsing...@apache.org
>> >wrote:
>> >
>> >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off
>> will
>> >> be very efficient.  Plus, there is virtually no code to write.  I've
>> seen
>> >> bare bones indexes be as little as 20% of the original w/ very fast
>> lookup.
>> >> Furthermore, there are many options available for controlling how much
>> is
>> >> loaded into memory, etc.  Finally, it will handle all the languages you
>> >> throw at it.
>> >>
>> >> -Grant
>> >>
>> >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
>> >>
>> >>> Currently java strings use double the space of the characters in it
>> >> because
>> >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB
>> >> when
>> >>> loaded into a HashMap<String, Integer>.  Is there some optimization we
>> >> could
>> >>> do in terms of storing them and ensuring that chinese, devanagiri and
>> >> other
>> >>> characters dont get messed up in the process.
>> >>>
>> >>> Some options benson suggested was: storing just the byte[] form and
>> >> adding
>> >>> the the option of supplying the hash function in OpenObjectIntHashmap
>> or
>> >>> even using a UTF-8 string.
>> >>>
>> >>> Or we could leave this alone. I currently estimate the memory
>> requirement
>> >>> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings
>> when
>> >>> generating the dictionary split for the vectorizer
>> >>>
>> >>> Robin
>> >>
>> >>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>

Reply via email to