Everybody realizes that the Explain feedback mechanism could use some more structure, LUCENE-3013, SOLR-3124, etc.
Clearly we need more specific setters and getters, to avoid trying to parse (guess) what each explain string means. So let's say we were to "bite the bullet" and revisit the interfaces, and then revisit ALL the classes in the code base. Possibly daunting at first, but for the sake of argument, let's say we did. As you think about use cases, and assuming you had the new methods already, some of the more interesting use cases still won't be med. It'd be a shame to revisit all those classes and still fall short, so I'd like to expand the scope a little bit to think about the types of data that would really be helpful. From an implementation standpoint these already sound a bit "heavy weight" to me, so that's partly why I'm emailing the list. Would it be OK if explain took longer, OR is there some type of "lazy load / late binding" decoration we could add that would facilitate pulling additional info as needed. Or maybe some switch that says "light vs. heavy" so you can decide. The list: Objects it'd be nice to have convenient access to, and examples of why, and some of the potential drawbacks. * At each node in the explain tree, access to the corresponding node in the query parse tree. Scenario: You'd really like to see which part of the query was "at fault". This gets harder and harder to do as your query gets large. For example, imagine a complex query editor application, wouldn't it be nice to run that query and trace back through, to see each node. I can imagine a couple workarounds now, but would be nice if there were an "accounting" for each explain node. Potential problems: Is there a structured format in Solr that we could serialize into the output. Lucene has great query classes, and you can build them yourself, but historically Solr has lacked that. * At each node in the explain tree, access to the field and document that matched, including some statistics. Scenario: related to next two items, broken out here for clarity Currently you do have access to the results, which you can go find by document ID. But the document ID name is schema dependent, and you'd have to look that up. And if a field is multivalued, which of the instances in that document's results does this "explain" correspond to? You can try to do pattern matching, but them you're trying to match up normalized tokens against the original text, a bit messy. * At each node in the explain tree, access to the corresponding tokens that matched. Ideally this be down to the highlight level, and also provide access to the field definition and stored value that the token belonged to. If there are IDF calculations involved, it'd be nice to have those stats as well (currently in the text explain) Scenario: In debugging a query you see a token that you don't recognize. Maybe it's been highly normalized, or perhaps it's an encoded value. It'd be nice to see where this token came from, the original value (if stored), and the field definition. If there's a payload or POS (part-of-speech), it'd be helpful to see that. If it's a text field, bring up a highlighted portion of that field. I'm not sure but I believe a single explain line could correspond to more than one highlight, if the word appeared more than once in that field. * At each node in the explain tree, an easy way to associate it with the corresponding node in another document's explain tree that came back in the same search results. Scenario: You're trying to figure out why one document came back in front of (or behind) another document, and you're comparing the explain trees. Many parts of the tree might be identical. Wouldn't it be nice to walk the tree and directly compare, with confidence, those same nodes in the other explain trees. Which ones are really different? Problem: The explain trees could be different for each document depending on what matched. Maybe it's a type of key system that is flexible enough to simply label nodes that ARE in multiple trees with the same ID, and let's you query which docs do and do not have that explain node. * Overall context for the Explain tree, things like query handler, ,system ID, document corpus, etc. Scenario: I want to compare the same document that's retrieved by the same query, but maybe the two tests were 6 months apart. Clearly the IDF scores will have changed. Or possibly this came through a different handler. Scenario: I want to compare the same document retrieved from my staging system from an explain tree from Production. The queries are likely similar, but maybe there's some differences. Scenario: I want to store explain samples of known queries against semi-predictable corpora across various systems and at veracious points of time. I'd like to associate them back to the query, the handler, the system, statistics, etc. Problem: the "context" of a search is almost unbounded. In the ludicrous extreme I could zip up the entire index and configuration for each example. But there's some subset of context that I might want. At a minimum, "place and time" (maybe a system ID?), any parameters that were passed in, active search filters, default values that were used, etc. Clearly some context info can be polled for (current time, query handler config, etc), so maybe the challenge is to just verify that enough info exists conveniently to do that, and maybe provide some sample code. Collection statistics seem like they'd be compact enough to include "for free", and would apply to all (most?) of the returned results. And maybe detailed document info is polled for if needed, maybe retrieving all fields for a fat archive, but at least having a predictable call that knows the correct ID to use. Another aspect of this is the hierarchal nature of the Explain tree. For example, the entire tree corresponds to one document. And from some intermediate node down, all of those nodes might correspond to a particular field. Sol there's no sense in repeating that info over and over. I think it would be hard to layer this on top in bits and pieces with additional calls. To me it seems like the best time to gather this info is while the query is being executed. Although this seems like a lot, imagine that you DON'T have this info. Although you could potentially decorate explanations a bit more if you had more getters, you're still left looking at a very narrow context in the middle of a very large tree. Without the other objects, it's hard to have any context or wisdom about why. Without this type of info, it'd be difficult to build much more than prettier formatting; very hard to associate that with anything "actionable" -- Mark Bennett / LucidWorks: Search & Big Data / [email protected] Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513
