[jira] [Commented] (SOLR-9480) Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph)

Hoss Man (JIRA) Thu, 17 May 2018 17:43:49 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479952#comment-16479952
 ]


Hoss Man commented on SOLR-9480:
--------------------------------

Updated patch with all nocommits resolved and new ref-guide content on the 
relatedness() aggregate function and using them to build SKGs.

I think this is pretty much good to go.
----
{quote}can you give a clue what are {{$fore,$back}} ?
{quote}
I'm not sure if i understand your question... are you asking about the syntax, 
or about the general concepts of foreground/background query as used in the 
relatedness function scores?

Syntactically they are regular query param {{$variable}} references passed as 
function arguments ... the sample request in the comment you replied to defined 
them as {{fore=body:%22harry+potter%22&back=\*:*}} ...but they can also just be 
passed in as string literals.

In general, the {{relatedness()}} function takes 2 parameters that define a 
"foreground query" and a "background query" which are then used to compute the 
hueristic score indicating what sort of statistical corrolation there is 
between the query for each facet bucket and the foreground set, relative to the 
background set.

There's a more self contained example in the ref-guide edits included in the 
latest patch...
{noformat}
.Sample Documents
[source,bash,subs="verbatim,callouts"]
----
curl -sS -X POST 'http://localhost:8983/solr/gettingstarted/update?commit=true' 
-d '[
{"id":"01",age:15,"state":"AZ","hobbies":["soccer","painting","cycling"]},
{"id":"02",age:22,"state":"AZ","hobbies":["swimming","darts","cycling"]},
{"id":"03",age:27,"state":"AZ","hobbies":["swimming","frisbee","painting"]},
{"id":"04",age:33,"state":"AZ","hobbies":["darts"]},
{"id":"05",age:42,"state":"AZ","hobbies":["swimming","golf","painting"]},
{"id":"06",age:54,"state":"AZ","hobbies":["swimming","golf"]},
{"id":"07",age:67,"state":"AZ","hobbies":["golf","painting"]},
{"id":"08",age:71,"state":"AZ","hobbies":["painting"]},
{"id":"09",age:14,"state":"CO","hobbies":["soccer","frisbee","skiing","swimming","skating"]},
{"id":"10",age:23,"state":"CO","hobbies":["skiing","darts","cycling","swimming"]},
{"id":"11",age:26,"state":"CO","hobbies":["skiing","golf"]},
{"id":"12",age:35,"state":"CO","hobbies":["golf","frisbee","painting","skiing"]},
{"id":"13",age:47,"state":"CO","hobbies":["skiing","darts","painting","skating"]},
{"id":"14",age:51,"state":"CO","hobbies":["skiing","golf"]},
{"id":"15",age:64,"state":"CO","hobbies":["skating","cycling"]},
{"id":"16",age:73,"state":"CO","hobbies":["painting"]},
]'
----

.Example Query
[source,bash,subs="verbatim,callouts"]
----
curl -sS -X POST http://localhost:8983/solr/gettingstarted/query -d 
'rows=0&q=*:*
&back=*:*                                  # <1>
&fore=age:[35 TO *]                        # <2>
&json.facet={
  hobby : {
    type : terms,
    field : hobbies,
    limit : 5,
    sort : { r1: desc },                   # <3>
    facet : {
      r1 : "relatedness($fore,$back)",     # <4>
      location : {
        type : terms,
        field : state,
        limit : 2,
        sort : { r2: desc },               # <3>
        facet : {
          r2 : "relatedness($fore,$back)"  # <4>
        }
      }
    }
  }
}'
----
<1> Use the entire collection as our "Background Set"
<2> Use a query for "age >= 35" to define our (initial) "Foreground Set"
<3> For both the top level `hobbies` facet & the sub-facet on `state` we will 
be sorting on the `relatedness(...)` values
<4> In both calls to the `relatedness(...)` function, we use 
<<local-parameters-in-queries.adoc#parameter-dereferencing,Parameter 
Variables>> to refer to the previously defined `fore` and `back` queries. 

.The Facet Response
[source,javascript,subs="verbatim,callouts"]
----
"facets":{
  "count":16,
  "hobby":{
    "buckets":[{
        "val":"golf",
        "count":6,                                // <1>
        "r1":{
          "relatedness":0.01225,
          "foreground_popularity":0.3125,         // <2>
          "background_popularity":0.375},         // <3>
        "location":{
          "buckets":[{
              "val":"az",
              "count":3,
              "r2":{
                "relatedness":0.00496,            // <4>
                "foreground_popularity":0.1875,   // <6>
                "background_popularity":0.5}},    // <7>
            {
              "val":"co",
              "count":3,
              "r2":{
                "relatedness":-0.00496,           // <5>
                "foreground_popularity":0.125,
                "background_popularity":0.5}}]}},
      {
        "val":"painting",
        "count":8,                                // <1>
        "r1":{
          "relatedness":0.01097,
          "foreground_popularity":0.375,
          "background_popularity":0.5},
        "location":{
          "buckets":[{
            ...
----
<1> Even though `hobbies:golf` has a lower total facet `count` then 
`hobbies:painting`, it has a higher `relatedness` score, indicating that 
relative to the Background Set (the entire collection) Golf has a stronger 
correlation to our Foreground Set (people age 35+) then Painting. 
<2> The number of documents matching `age:[35 TO *]` _and_ `hobbies:golf` is 
31.25% of the total number of documents in the Background Set
<3> 37.5% of the documents in the Background Set match `hobbies:golf`
<4> The state of Arizona (AZ) has a _positive_ relatedness correlation with the 
_nested_ Foreground Set (people ages 35+ who play Golf) compared to the 
Background Set -- ie: "People in Arizona are statistically more likely to be 
'35+ year old Golfers' then the country as a whole."
<5> The state of Colorado (CO) has a _negative_ correlation with the nested 
Foreground Set -- ie: "People in Colorado are statistically less likely to be 
'35+ year old Golfers' then the country as a whole."
<6> The number documents matching `age:[35 TO *]` _and_ `hobbies:golf` _and_ 
`state:AZ` is 18.75% of the total number of documents in the Background Set
<7> 50% of the documents in the Background Set match `state:AZ`

NOTE: While it's very common to define the Background Set as `\*:*`, or some 
other super-set of the Foreground Query, it is not strictly required.  The 
`relatedness(...)` function can be used to compare the statistical relatedness 
of sets of documents to orthogonal foreground/background queries.

{noformat}

> Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph)
> --------------------------------------------------------------------------
>
>                 Key: SOLR-9480
>                 URL: https://issues.apache.org/jira/browse/SOLR-9480
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Trey Grainger
>            Priority: Major
>         Attachments: SOLR-9480.patch, SOLR-9480.patch, SOLR-9480.patch, 
> SOLR-9480.patch, SOLR-9480.patch, SOLR-9480.patch
>
>
> This issue is to track the contribution of the Semantic Knowledge Graph Solr 
> Plugin (request handler), which exposes a graph-like interface for 
> discovering and traversing significant relationships between entities within 
> an inverted index.
> This data model has been described in the following research paper: [The 
> Semantic Knowledge Graph: A compact, auto-generated model for real-time 
> traversal and ranking of any relationship within a 
> domain|https://arxiv.org/abs/1609.00464], as well as in presentations I gave 
> in October 2015 at [Lucene/Solr 
> Revolution|http://www.slideshare.net/treygrainger/leveraging-lucenesolr-as-a-knowledge-graph-and-intent-engine]
>  and November 2015 at the [Bay Area Search 
> Meetup|http://www.treygrainger.com/posts/presentations/searching-on-intent-knowledge-graphs-personalization-and-contextual-disambiguation/].
> The source code for this project is currently available at 
> [https://github.com/careerbuilder/semantic-knowledge-graph], and the folks at 
> CareerBuilder (where this was built) have given me the go-ahead to now 
> contribute this back to the Apache Solr Project, as well.
> Check out the Github repository, research paper, or presentations for a more 
> detailed description of this contribution. Initial patch coming soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9480) Graph Traversal for Significantly Related Terms (Semantic Knowledge Graph)

Reply via email to