Can you define an approximate score that will give you a small candidate set that you can score in detail?

Likewise can you restate your scoring algo using stack frame pairs? Using ngrams is often used as a very good surrogate for edit distance scores such as you are trying to build.

Sent from my iPhone

On Sep 9, 2010, at 3:36 AM, Johannes Lerch <lerch.johan...@googlemail.com > wrote:

As my tests show about 1/4 documents are relevant for scoring per query. So for my example with 100000 stacktraces in the index i need to score 25000 documents. I have a native implementation of the scoring algorithm which scores all 100000. That needs about 20ms. The lucene implementation needs for the same query >100ms what really sucks. Without retrieving fields it
needs about 6ms - thats also what my target should be.

I tried without LAZY_LOAD, but there is no real difference. How can i sort
by docIds first?

FieldCache.DEFAULT.getStrings ist not a possibility cause of to the memory
problem.
This is how i store frames:
for(StacktraceFrame frame : stacktrace.getFrames()) {
 doc.add(new Field(FIELD_FRAMES,
frame.getClassName()+"."+frame.getMethod(), Store.YES, Index.NOT_ANALYZED));
}



2010/9/9 Michael McCandless <luc...@mikemccandless.com>

What a neat search engine!  (Searching stack traces).

Unfortunately, loading stored fields is slowish -- it entails 2 disk
seeks under the hood. Really you should retrieve at most a page worth
of docs, in the serial path of a query.  How many are you retrieving
per query?

That said, you shouldn't use LAZY_LOAD if you know you will need the
value.  Also, it's possible that sorting the docIDs (ascending) first
may get you better performance since your load is then a single scan
of the 2 files in the index.

You may want to use FieldCache.DEFAULT.getStrings instead -- this
gives you a very fast String[], but, may suck up tons of memory
depending on how many unique frames there are (how do you index each
frame?).

Mike

On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
<lerch.johan...@googlemail.com> wrote:
Hi,

i am working on a search for stacktraces. To do this i implemented my own
Query, Weight and Scorer. I save exception, method and the frames as
fields
in the index and am able to pick relevant documents by matching those
fields
with my query stacktrace (using IndexReader.termDocs()). I implemented my own scoring which is calculated pairwise for stacktraces (the one of the query and each of the relevant documents). For this scoring i calculate a similarity between both traces by comparing the frames if they exist in
both
and also check for ordering. This works similar as diff on text/ source
code.
My problem is, that i need all frames contained in both stacktraces, so i have to retrieve all frame fields of the stored stacktraces. For now i do
this with:
Document document = reader.document(doc, new FieldSelector() {
          @Override
          public FieldSelectorResult accept(String fieldName) {
              if(Indexer.FIELD_FRAMES.equals(fieldName))
                  return FieldSelectorResult.LAZY_LOAD;
              else
                  return FieldSelectorResult.NO_LOAD;
          }
      });
Fieldable[] fieldables = document.getFieldables (Indexer.FIELD_FRAMES);

But this call really decreases performance to something which is not
agreeable for me (>10 times slower on 100000 stacktraces in index). So my question is, are there are other ways to get stored fields or do you have ideas for workarounds. Would it be better to store all stacktraces in a database and retrieve them from there? If so how do i get the docId of
stacktraces i wrote to the index?

Regards,
Johannes


Reply via email to