Re: Performance problems on retrieving fields

Ted Dunning Thu, 09 Sep 2010 08:51:56 -0700

Can you define an approximate score that will give you a smallcandidate set that you can score in detail?

Likewise can you restate your scoring algo using stack frame pairs?Using ngrams is often used as a very good surrogate for edit distancescores such as you are trying to build.


Sent from my iPhone

On Sep 9, 2010, at 3:36 AM, Johannes Lerch <lerch.johan...@googlemail.com> wrote:

As my tests show about 1/4 documents are relevant for scoring perquery. Sofor my example with 100000 stacktraces in the index i need to score25000documents. I have a native implementation of the scoring algorithmwhichscores all 100000. That needs about 20ms. The lucene implementationneedsfor the same query >100ms what really sucks. Without retrievingfields it
needs about 6ms - thats also what my target should be.
I tried without LAZY_LOAD, but there is no real difference. How cani sort
by docIds first?
FieldCache.DEFAULT.getStrings ist not a possibility cause of to thememory
problem.
This is how i store frames:
for(StacktraceFrame frame : stacktrace.getFrames()) {
 doc.add(new Field(FIELD_FRAMES,
frame.getClassName()+"."+frame.getMethod(), Store.YES,Index.NOT_ANALYZED));
}



2010/9/9 Michael McCandless <luc...@mikemccandless.com>
What a neat search engine!  (Searching stack traces).

Unfortunately, loading stored fields is slowish -- it entails 2 disk
seeks under the hood. Really you should retrieve at most a pageworth
of docs, in the serial path of a query.  How many are you retrieving
per query?

That said, you shouldn't use LAZY_LOAD if you know you will need the
value.  Also, it's possible that sorting the docIDs (ascending) first
may get you better performance since your load is then a single scan
of the 2 files in the index.

You may want to use FieldCache.DEFAULT.getStrings instead -- this
gives you a very fast String[], but, may suck up tons of memory
depending on how many unique frames there are (how do you index each
frame?).

Mike

On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
<lerch.johan...@googlemail.com> wrote:
Hi,
i am working on a search for stacktraces. To do this i implementedmy own
Query, Weight and Scorer. I save exception, method and the frames as
fields
in the index and am able to pick relevant documents by matchingthose
fields
with my query stacktrace (using IndexReader.termDocs()). Iimplemented myown scoring which is calculated pairwise for stacktraces (the oneof thequery and each of the relevant documents). For this scoring icalculate asimilarity between both traces by comparing the frames if theyexist in
both
and also check for ordering. This works similar as diff on text/source
code.
My problem is, that i need all frames contained in bothstacktraces, so ihave to retrieve all frame fields of the stored stacktraces. Fornow i do
this with:
Document document = reader.document(doc, new FieldSelector() {
          @Override
          public FieldSelectorResult accept(String fieldName) {
              if(Indexer.FIELD_FRAMES.equals(fieldName))
                  return FieldSelectorResult.LAZY_LOAD;
              else
                  return FieldSelectorResult.NO_LOAD;
          }
      });
Fieldable[] fieldables = document.getFieldables(Indexer.FIELD_FRAMES);
But this call really decreases performance to something which is not
agreeable for me (>10 times slower on 100000 stacktraces inindex). So myquestion is, are there are other ways to get stored fields or doyou haveideas for workarounds. Would it be better to store all stacktracesin adatabase and retrieve them from there? If so how do i get thedocId of
stacktraces i wrote to the index?

Regards,
Johannes

Re: Performance problems on retrieving fields

Reply via email to