Can you define an approximate score that will give you a small
candidate set that you can score in detail?
Likewise can you restate your scoring algo using stack frame pairs?
Using ngrams is often used as a very good surrogate for edit distance
scores such as you are trying to build.
Sent from my iPhone
On Sep 9, 2010, at 3:36 AM, Johannes Lerch <lerch.johan...@googlemail.com
> wrote:
As my tests show about 1/4 documents are relevant for scoring per
query. So
for my example with 100000 stacktraces in the index i need to score
25000
documents. I have a native implementation of the scoring algorithm
which
scores all 100000. That needs about 20ms. The lucene implementation
needs
for the same query >100ms what really sucks. Without retrieving
fields it
needs about 6ms - thats also what my target should be.
I tried without LAZY_LOAD, but there is no real difference. How can
i sort
by docIds first?
FieldCache.DEFAULT.getStrings ist not a possibility cause of to the
memory
problem.
This is how i store frames:
for(StacktraceFrame frame : stacktrace.getFrames()) {
doc.add(new Field(FIELD_FRAMES,
frame.getClassName()+"."+frame.getMethod(), Store.YES,
Index.NOT_ANALYZED));
}
2010/9/9 Michael McCandless <luc...@mikemccandless.com>
What a neat search engine! (Searching stack traces).
Unfortunately, loading stored fields is slowish -- it entails 2 disk
seeks under the hood. Really you should retrieve at most a page
worth
of docs, in the serial path of a query. How many are you retrieving
per query?
That said, you shouldn't use LAZY_LOAD if you know you will need the
value. Also, it's possible that sorting the docIDs (ascending) first
may get you better performance since your load is then a single scan
of the 2 files in the index.
You may want to use FieldCache.DEFAULT.getStrings instead -- this
gives you a very fast String[], but, may suck up tons of memory
depending on how many unique frames there are (how do you index each
frame?).
Mike
On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
<lerch.johan...@googlemail.com> wrote:
Hi,
i am working on a search for stacktraces. To do this i implemented
my own
Query, Weight and Scorer. I save exception, method and the frames as
fields
in the index and am able to pick relevant documents by matching
those
fields
with my query stacktrace (using IndexReader.termDocs()). I
implemented my
own scoring which is calculated pairwise for stacktraces (the one
of the
query and each of the relevant documents). For this scoring i
calculate a
similarity between both traces by comparing the frames if they
exist in
both
and also check for ordering. This works similar as diff on text/
source
code.
My problem is, that i need all frames contained in both
stacktraces, so i
have to retrieve all frame fields of the stored stacktraces. For
now i do
this with:
Document document = reader.document(doc, new FieldSelector() {
@Override
public FieldSelectorResult accept(String fieldName) {
if(Indexer.FIELD_FRAMES.equals(fieldName))
return FieldSelectorResult.LAZY_LOAD;
else
return FieldSelectorResult.NO_LOAD;
}
});
Fieldable[] fieldables = document.getFieldables
(Indexer.FIELD_FRAMES);
But this call really decreases performance to something which is not
agreeable for me (>10 times slower on 100000 stacktraces in
index). So my
question is, are there are other ways to get stored fields or do
you have
ideas for workarounds. Would it be better to store all stacktraces
in a
database and retrieve them from there? If so how do i get the
docId of
stacktraces i wrote to the index?
Regards,
Johannes