Forgot my one other "big ticket item": Allow for Explanations for why a document DIDN'T match. I realize this sounds a bit odd, but would be very helpful. And the reason is hierarchal in nature. So if a leaf didn't match a particular search term, and it's parent is an AND operator, then it would inherit that reason.
Also, if the query evaluation engine allows for short-circuiting, then you might not have all the reasons it's didn't match, but at least one reason. It'd be interesting if we could allow a switch to the query-evaluator to not do short-circuiting, so it would visit each node and provide feedback, but I"m not sure if that's feasible. And finally on this front, if we also had some way to associate nodes in all the explain trees in the result with each other, then if a node is missing for one document, we could say why. Very powerful as you compare different documents. On Mar 14, 2013, at 2:33 PM, Mark Bennett <[email protected]> wrote: > Everybody realizes that the Explain feedback mechanism could use some more > structure, LUCENE-3013, SOLR-3124, etc. > > Clearly we need more specific setters and getters, to avoid trying to parse > (guess) what each explain string means. So let's say we were to "bite the > bullet" and revisit the interfaces, and then revisit ALL the classes in the > code base. Possibly daunting at first, but for the sake of argument, let's > say we did. > > As you think about use cases, and assuming you had the new methods already, > some of the more interesting use cases still won't be med. It'd be a shame > to revisit all those classes and still fall short, so I'd like to expand the > scope a little bit to think about the types of data that would really be > helpful. > > From an implementation standpoint these already sound a bit "heavy weight" to > me, so that's partly why I'm emailing the list. Would it be OK if explain > took longer, OR is there some type of "lazy load / late binding" decoration > we could add that would facilitate pulling additional info as needed. Or > maybe some switch that says "light vs. heavy" so you can decide. > > The list: > Objects it'd be nice to have convenient access to, and examples of why, and > some of the potential drawbacks. > > * At each node in the explain tree, access to the corresponding node in the > query parse tree. > Scenario: You'd really like to see which part of the query was "at fault". > This gets harder and harder to do as your query gets large. For example, > imagine a complex query editor application, wouldn't it be nice to run that > query and trace back through, to see each node. > I can imagine a couple workarounds now, but would be nice if there were an > "accounting" for each explain node. > Potential problems: > Is there a structured format in Solr that we could serialize into the output. > Lucene has great query classes, and you can build them yourself, but > historically Solr has lacked that. > > * At each node in the explain tree, access to the field and document that > matched, including some statistics. > Scenario: related to next two items, broken out here for clarity > Currently you do have access to the results, which you can go find by > document ID. But the document ID name is schema dependent, and you'd have to > look that up. And if a field is multivalued, which of the instances in that > document's results does this "explain" correspond to? You can try to do > pattern matching, but them you're trying to match up normalized tokens > against the original text, a bit messy. > > * At each node in the explain tree, access to the corresponding tokens that > matched. Ideally this be down to the highlight level, and also provide > access to the field definition and stored value that the token belonged to. > If there are IDF calculations involved, it'd be nice to have those stats as > well (currently in the text explain) > Scenario: In debugging a query you see a token that you don't recognize. > Maybe it's been highly normalized, or perhaps it's an encoded value. It'd be > nice to see where this token came from, the original value (if stored), and > the field definition. If there's a payload or POS (part-of-speech), it'd be > helpful to see that. If it's a text field, bring up a highlighted portion of > that field. I'm not sure but I believe a single explain line could > correspond to more than one highlight, if the word appeared more than once in > that field. > > * At each node in the explain tree, an easy way to associate it with the > corresponding node in another document's explain tree that came back in the > same search results. > Scenario: You're trying to figure out why one document came back in front of > (or behind) another document, and you're comparing the explain trees. Many > parts of the tree might be identical. Wouldn't it be nice to walk the tree > and directly compare, with confidence, those same nodes in the other explain > trees. Which ones are really different? > Problem: The explain trees could be different for each document depending on > what matched. Maybe it's a type of key system that is flexible enough to > simply label nodes that ARE in multiple trees with the same ID, and let's you > query which docs do and do not have that explain node. > > * Overall context for the Explain tree, things like query handler, ,system > ID, document corpus, etc. > Scenario: I want to compare the same document that's retrieved by the same > query, but maybe the two tests were 6 months apart. Clearly the IDF scores > will have changed. Or possibly this came through a different handler. > Scenario: I want to compare the same document retrieved from my staging > system from an explain tree from Production. The queries are likely similar, > but maybe there's some differences. > Scenario: I want to store explain samples of known queries against > semi-predictable corpora across various systems and at veracious points of > time. I'd like to associate them back to the query, the handler, the system, > statistics, etc. > Problem: the "context" of a search is almost unbounded. In the ludicrous > extreme I could zip up the entire index and configuration for each example. > But there's some subset of context that I might want. At a minimum, "place > and time" (maybe a system ID?), any parameters that were passed in, active > search filters, default values that were used, etc. > Clearly some context info can be polled for (current time, query handler > config, etc), so maybe the challenge is to just verify that enough info > exists conveniently to do that, and maybe provide some sample code. > Collection statistics seem like they'd be compact enough to include "for > free", and would apply to all (most?) of the returned results. And maybe > detailed document info is polled for if needed, maybe retrieving all fields > for a fat archive, but at least having a predictable call that knows the > correct ID to use. > > Another aspect of this is the hierarchal nature of the Explain tree. For > example, the entire tree corresponds to one document. And from some > intermediate node down, all of those nodes might correspond to a particular > field. Sol there's no sense in repeating that info over and over. > > I think it would be hard to layer this on top in bits and pieces with > additional calls. To me it seems like the best time to gather this info is > while the query is being executed. > > Although this seems like a lot, imagine that you DON'T have this info. > Although you could potentially decorate explanations a bit more if you had > more getters, you're still left looking at a very narrow context in the > middle of a very large tree. Without the other objects, it's hard to have > any context or wisdom about why. > > Without this type of info, it'd be difficult to build much more than prettier > formatting; very hard to associate that with anything "actionable" > > -- > Mark Bennett / LucidWorks: Search & Big Data / [email protected] > Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513 > > > > > > >
