Revisiting the Explain API, and ironically, were we thinking Too Small?

Mark Bennett Thu, 14 Mar 2013 14:34:31 -0700

Everybody realizes that the Explain feedback mechanism could use some more 
structure, LUCENE-3013, SOLR-3124, etc.


Clearly we need more specific setters and getters, to avoid trying to parse 
(guess) what each explain string means.  So let's say we were to "bite the 
bullet" and revisit the interfaces, and then revisit ALL the classes in the 
code base.  Possibly daunting at first, but for the sake of argument, let's say 
we did.

As you think about use cases, and assuming you had the new methods already, 
some of the more interesting use cases still won't be med.  It'd be a shame to 
revisit all those classes and still fall short, so I'd like to expand the scope 
a little bit to think about the types of data that would really be helpful.

From an implementation standpoint these already sound a bit "heavy weight" to 
me, so that's partly why I'm emailing the list.  Would it be OK if explain took 
longer, OR is there some type of "lazy load / late binding" decoration we could 
add that would facilitate pulling additional info as needed.  Or maybe some 
switch that says "light vs. heavy" so you can decide.

The list:
Objects it'd be nice to have convenient access to, and examples of why, and 
some of the potential drawbacks.

* At each node in the explain tree, access to the corresponding node in the 
query parse tree.
Scenario: You'd really like to see which part of the query was "at fault".  
This gets harder and harder to do as your query gets large.  For example, 
imagine a complex query editor application, wouldn't it be nice to run that 
query and trace back through, to see each node.
I can imagine a couple workarounds now, but would be nice if there were an 
"accounting" for each explain node.
Potential problems:
Is there a structured format in Solr that we could serialize into the output.  
Lucene has great query classes, and you can build them yourself, but 
historically Solr has lacked that.

* At each node in the explain tree, access to the field and document that 
matched, including some statistics.
Scenario: related to next two items, broken out here for clarity
Currently you do have access to the results, which you can go find by document 
ID.  But the document ID name is schema dependent, and you'd have to look that 
up.  And if a field is multivalued, which of the instances in that document's 
results does this "explain" correspond to?  You can try to do pattern matching, 
but them you're trying to match up normalized tokens against the original text, 
a bit messy.

* At each node in the explain tree, access to the corresponding tokens that 
matched.  Ideally this be down to the highlight level, and also provide access 
to the field definition and stored value that the token belonged to.  If there 
are IDF calculations involved, it'd be nice to have those stats as well 
(currently in the text explain)
Scenario: In debugging a query you see a token that you don't recognize.  Maybe 
it's been highly normalized, or perhaps it's an encoded value.  It'd be nice to 
see where this token came from, the original value (if stored), and the field 
definition.  If there's a payload or POS (part-of-speech), it'd be helpful to 
see that.  If it's a text field, bring up a highlighted portion of that field.  
I'm not sure but I believe a single explain line could correspond to more than 
one highlight, if the word appeared more than once in that field.

* At each node in the explain tree, an easy way to associate it with the 
corresponding node in another document's explain tree that came back in the 
same search results.
Scenario: You're trying to figure out why one document came back in front of 
(or behind) another document, and you're comparing the explain trees.  Many 
parts of the tree might be identical.  Wouldn't it be nice to walk the tree and 
directly compare, with confidence, those same nodes in the other explain trees. 
 Which ones are really different?
Problem: The explain trees could be different for each document depending on 
what matched.  Maybe it's a type of key system that is flexible enough to 
simply label nodes that ARE in multiple trees with the same ID, and let's you 
query which docs do and do not have that explain node.

* Overall context for the Explain tree, things like query handler, ,system ID, 
document corpus, etc.
Scenario: I want to compare the same document that's retrieved by the same 
query, but maybe the two tests were 6 months apart.  Clearly the IDF scores 
will have changed.  Or possibly this came through a different handler.
Scenario: I want to compare the same document retrieved from my staging system 
from an explain tree from Production.  The queries are likely similar, but 
maybe there's some differences.
Scenario: I want to store explain samples of known queries against 
semi-predictable corpora across various systems and at veracious points of 
time.  I'd like to associate them back to the query, the handler, the system, 
statistics, etc.
Problem: the "context" of a search is almost unbounded.  In the ludicrous 
extreme I could zip up the entire index and configuration for each example.  
But there's some subset of context that I might want.  At a minimum, "place and 
time"  (maybe a system ID?), any parameters that were passed in, active search 
filters, default values that were used, etc.
Clearly some context info can be polled for (current time, query handler 
config, etc), so maybe the challenge is to just verify that enough info exists 
conveniently to do that, and maybe provide some sample code.  Collection 
statistics seem like they'd be compact enough to include "for free", and would 
apply to all (most?) of the returned results.  And maybe detailed document info 
is polled for if needed, maybe retrieving all fields for a fat archive, but at 
least having a predictable call that knows the correct ID to use.

Another aspect of this is the hierarchal nature of the Explain tree.  For 
example, the entire tree corresponds to one document.  And from some 
intermediate node down, all of those nodes might correspond to a particular 
field.  Sol there's no sense in repeating that info over and over.

I think it would be hard to layer this on top in bits and pieces with 
additional calls.  To me it seems like the best time to gather this info is 
while the query is being executed.

Although this seems like a lot, imagine that you DON'T have this info.  
Although you could potentially decorate explanations a bit more if you had more 
getters, you're still left looking at a very narrow context in the middle of a 
very large tree.  Without the other objects, it's hard to have any context or 
wisdom about why.

Without this type of info, it'd be difficult to build much more than prettier 
formatting; very hard to associate that with anything "actionable"

--
Mark Bennett / LucidWorks: Search & Big Data / [email protected]
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

Revisiting the Explain API, and ironically, were we thinking Too Small?

Reply via email to