[ https://issues.apache.org/jira/browse/SOLR-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466664#comment-16466664 ]
Hoss Man commented on SOLR-9480: -------------------------------- I've been playing with this "SKG" code off an on for a bit, and talking with trey about it offline, and I think I've come up with a nice strategy for integrating it directly into JSON Faceting as a handfull small improvements, to be able to leverage most of the existing JSON Facet code (including distributed refinement) w/o needing to be "tacked on" to the outside as much as the code in the older patch/github-repo does. The attached patch includes the 3 most "important" of these improvements in order for this to work, along with lots of new tests... # Refactoring the {{SlotAcc}} so that everytime it's asked to {{collect(...)}} a slot, it has the ability to ask for a {{Query}} that identifies this slot _independent of the current context_ ** This allows the new SKG code (described below) to have enough information about the current bucket that it can compute the full "foreground" that it wants, regardless of the bucket type or how the current bucket may be nested under other facets ... so SKG graphs can be built by nesting any types of facets (including range & query facets) regardless of how the top level {{q + fq}} params realted to the {{foreground}} query. ** this happens via a {{IntFunction<Query>}} callback function – so there shouldn't be any overhead to existing FacetProcessor/SlotAcc usages that don't care about this extra info about the bucket – there's no extra {{TermQuery}} (or {{RangeQuery}} etc...) overhead when the accumulators only care about the final filtered set of documents for the current bucket/slot. # Add an {{skg(...)}} AggValueSource function that can be nested under any facet ** this function takes in the "foreground" and "background" queries to use, which just like any existing (aggregate) function can be {{$variables}} pointing to existing request params ** this means that, unlike the original SKG code linked from this issue, you could compute the SKG relatedness info only at certain points in the facet herirachy, or use different foreground/background queries in different places ** this function actually produces JSON "objects" as the function result, containing the foreground/background popularities as well as the "relatedness" score – which is what's used if you sort on this function ** I originally experimented with implementing "SKG" as a new type of _facet_ that could be nested under any (othe) facet, but implementing as a function means that we can leverage the existing code for sorting (parent) facet buckets by the (child) function's results – which is very powerful for SKG results (and it's not currently possible to "sort" on the results of a sub-facet, and doing so would be a lot of work given how sub-facet refinement is currently handled ... i looked into it briefly) ** but sorting on the {{skg()}} function is optional, and not strictly neccessary when the clients care more about performance then accuracy – as with the existing SKG code trey contributed, the (default) sort on facet count could still be used, which means the existing JSON faceting code would only compute the (semi-expensive) {{skg()}} function on the final buckets to be returned, and the client could then post-process to re-sort them by the {{skg()}} values. # Add support for a "explicit query domain" via syntax like {{domain : \{query:'foo:bar'\}}} (or any other JSON query syntax supported by the {{filters}} option) that let's you arbitrarily pick any set of queries you want to use as a "domain" for a facet, regardless of it's parent facets/bucket. ** this provides an optional way to improve the "top n" accuracy of sub-facets in a deep SKG request, by letting you ignore the "ancestor facet bucket filtering" typically done in faceting, and instead request that *all* buckets under some arbitrarr query – like the original background query – be considered. ** SKG users that care more about speed & aproximations can ignore this feature, and just sort the regular facet terms by the {{skg()}} function to get a good aproximation of the top terms ... or as I mentioned before: trust the (default) sort on facet counts (w/or w/o using the {{$background_q}} as an explicit domain) to approximate the top N terms) An example of what all these features together can look like right now... {noformat} rows=0& q=type:QUESTION& fore=body:%22harry+potter%22& back=*:*& json.facet='{ tags : { type : terms, field : tags, limit : 5, sort : { skg: desc }, facet : { skg : "skg($fore,$back)", body : { type : terms, field : body, limit : 5, domain : { query:{param:back} }, sort : { skg: desc }, facet : { skg : "skg($fore,$back)" } } } } }' {noformat} There are still lots of things not included in the patch that could be added later to make all of this better and/or easier to use – and in most cases would be general improvements to JSON Faceting... * As noted in some {{TODO}} comments, I would love to enhance the syntax of the {{skg()}} function in a couple of ways... ** making the queries optional, and inheriting them from "ancestor" function instances higher up the tree... {noformat} { tags : { type : terms, field : tags, facet : { skg : "skg($fore,$back)", body : { type : terms, field : body, facet : { skg : "skg()" // inherits the $fore/$back queries from the 'skg' function of the parent facet } } } } } {noformat} ** I'd also like to improve the way JSON Facet functions are parsed – along the lines of what's described in SOLR-11709 – in order to support more "optional" args that could be used by {{skg()}} to override some of it's default behavior... *** this would be implemented under the covers by passing the extra map keys as the "localParams" for the ValueSourceParser *** Example: telling {{skg()}} that it's effective "sort" value should be based on the "foreground_pop" instead of the (default) "relatedness"... {noformat} tags : { type : terms, field : tags, sort : "skg desc", facet : { skg : { type : func, func : "skg($fore,$back)", sort_value : foreground_pop } } } {noformat} *** this could also be used to implement a {{min_pop}} type value, that could be used to configure the {{skg()}} function to return a relatedness of {{-Infinity}} for any bucket that didn't have foreground/background popularity ratios at least as high as some user specified value. * Similar to how the {{rerank}} request param allows people to collect & score documents using a "cheap" query, and then re-score the top N using a ore expensive query, I think it would be handy if JSON Facets supported a {{resort}} option that could be used on any {{FacetRequestSorted}} instance right along side the {{sort}} param, using the same JSON syntax, so that clients could have Solr internaly sort all the facet buckets by something simple (like count) and then "Re-Sort" the top {{N=limit}} (or maybe ( {{N=limit+overrequest}} ?) using a more expensive function like {{skg()}} ...however, I think most of this would be best left to other (future) Jiras, and they are only marked {{TODO}} in the current patch (if mentioned at all) ---- My current focus is on resolving the outstanding {{nocommits}} which tend to fall into these main categories (in order of importance) ... * resolving randomized test failures ** I used {{TestCloudJSONFacetJoinDomain}} a imspiration for a new {{TestCloudJSONFacetSKG}} that similarly tries to generate random indexes & requests and then "prove" that the results of those requests are accurate via verification queries ** i initially thought using {{refine:true}} + {{mincount:0}} + {{processEmpty:true}} would allow me to "prove" that the SKG results were accurate by executing the equivilent foreground/background queries for each bucket – but even with those options, i'm seeing some popularity ratios that are missing the denominator (size) from some shards when the numerator (count) is 0 ... making me think there is either some flaw in my reasoning about the provability, or some bug where the existing refinement logic isn't picking up the function contributions of some shards when the doc count is 0 ** even if this test approach proves flawed, the functionality itself can still be useful since it's largely about computing statistical aproximations – but i want to be 100% sure i understand *why* the test is failing before writting it off * refactoring some similar code ** the SKG distributed merging data structure is currently completely independent from the single-shard "SlotVal" objects ... this hsould be refactored to share code * can the distributed results be more efficient? ** right now the redundent fore & back "size" values (which are the same for every slot/bucket) are returned for every bucket ... i'd like to try and figure out if i can put that data in the facet "context" to reduce the shard response size. * figuring out what/how/where to put info in the facetDebug output ** it seems like it could be handy for people to be able to access the raw fore & back / count & size values for each bucket when debugging facets – i just have to figure out how to do that * javadocs * naming ** "Semantic Knowledge Graph" seems like a good name for the _concept_ of how these features can be used/combined, but the current _function_ {{skg(...)}} seems like it should probably have name more specific to the underlying relatedness forumla ... but i still don't really understand where exactly that formula comes from, so i'm not really clear yet on what a better name might be. ---- Any feedback/comments/concerns about this approachwould be appreciated > Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph) > -------------------------------------------------------------------------- > > Key: SOLR-9480 > URL: https://issues.apache.org/jira/browse/SOLR-9480 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Trey Grainger > Priority: Major > Attachments: SOLR-9480.patch, SOLR-9480.patch > > > This issue is to track the contribution of the Semantic Knowledge Graph Solr > Plugin (request handler), which exposes a graph-like interface for > discovering and traversing significant relationships between entities within > an inverted index. > This data model has been described in the following research paper: [The > Semantic Knowledge Graph: A compact, auto-generated model for real-time > traversal and ranking of any relationship within a > domain|https://arxiv.org/abs/1609.00464], as well as in presentations I gave > in October 2015 at [Lucene/Solr > Revolution|http://www.slideshare.net/treygrainger/leveraging-lucenesolr-as-a-knowledge-graph-and-intent-engine] > and November 2015 at the [Bay Area Search > Meetup|http://www.treygrainger.com/posts/presentations/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation/]. > The source code for this project is currently available at > [https://github.com/careerbuilder/semantic-knowledge-graph], and the folks at > CareerBuilder (where this was built) have given me the go-ahead to now > contribute this back to the Apache Solr Project, as well. > Check out the Github repository, research paper, or presentations for a more > detailed description of this contribution. Initial patch coming soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org