[jira] [Commented] (SOLR-9418) Statistical Phrase Identifier

David Smiley (Jira) Tue, 27 Aug 2019 06:00:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916697#comment-16916697
 ]


David Smiley commented on SOLR-9418:
------------------------------------

I just want to point out that there's no trace of this in the Solr Reference 
Guide, and as-such is basically a hidden feature.

> Statistical Phrase Identifier
> -----------------------------
>
>                 Key: SOLR-9418
>                 URL: https://issues.apache.org/jira/browse/SOLR-9418
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Akash Mehta
>            Assignee: Hoss Man
>            Priority: Major
>             Fix For: 7.5, 8.0
>
>         Attachments: SOLR-9418.patch, SOLR-9418.patch, SOLR-9418.patch, 
> SOLR-9418.zip
>
>
> h2. *Summary:*
> The Statistical Phrase Identifier is a Solr contribution that takes in a 
> string of text and then leverages a language model (an Apache Lucene/Solr 
> inverted index) to predict how the inputted text should be divided into 
> phrases. The intended purpose of this tool is to parse short-text queries 
> into phrases prior to executing a keyword search (as opposed parsing out each 
> keyword as a single term).
> It is being generously donated to the Solr project by CareerBuilder, with the 
> original source code and a quickly demo-able version located here:  
> [https://github.com/careerbuilder/statistical-phrase-identifier|https://github.com/careerbuilder/statistical-phrase-identifier,]
> h2. *Purpose:*
> Assume you're building a job search engine, and one of your users searches 
> for the following:
>  _machine learning research and development Portland, OR software engineer 
> AND hadoop, java_
> Most search engines will natively parse this query into the following boolean 
> representation:
>  _(machine AND learning AND research AND development AND Portland) OR 
> (software AND engineer AND hadoop AND java)_
> While this query may still yield relevant results, it is clear that the 
> intent of the user wasn't understood very well at all. By leveraging the 
> Statistical Phrase Identifier on this string prior to query parsing, you can 
> instead expect the following parsing:
> _{machine learning} \{and} \{research and development} \{Portland, OR} 
> \{software engineer} \{AND} \{hadoop,} \{java}_
> It is then possile to modify all the multi-word phrases prior to executing 
> the search:
>  _"machine learning" and "research and development" "Portland, OR" "software 
> engineer" AND hadoop, java_
> Of course, you could do your own query parsing to specifically handle the 
> boolean syntax, but the following would eventually be interpreted correctly 
> by Apache Solr and most other search engines:
>  _"machine learning" AND "research and development" AND "Portland, OR" AND 
> "software engineer" AND hadoop AND java_ 
> h2. *History:*
> This project was originally implemented by the search team at CareerBuilder 
> in the summer of 2015 for use as part of their semantic search system. In the 
> summer of 2016, Akash Mehta, implemented a much simpler version as a proof of 
> concept based upon publicly available information about the CareerBuilder 
> implementation (the first attached patch).  In July of 2018, CareerBuilder 
> open sourced their original version 
> ([https://github.com/careerbuilder/statistical-phrase-identifier),|https://github.com/careerbuilder/statistical-phrase-identifier,]
>  and agreed to also donate the code to the Apache Software foundation as a 
> Solr contribution. An Solr patch with the CareerBuilder version was added to 
> this issue on September 5th, 2018, and community feedback and contributions 
> are encouraged.
> This issue was originally titled the "Probabilistic Query Parser", but the 
> name has now been updated to "Statistical Phrase Identifier" to avoid 
> ambiguity with Solr's query parsers (per some of the feedback on this issue), 
> as the implementation is actually just a mechanism for identifying phrases 
> statistically from a string and is NOT a Solr query parser. 
> h2. *Example usage:*
> h3. (See contrib readme or configuration files in the patch for full 
> configuration details)
> h3. *{{Request:}}*
> {code:java}
> http://localhost:8983/solr/spi/parse?q=darth vader obi wan kenobi anakin 
> skywalker toad x men magneto professor xavier{code}
> h3. *{{Response:}}* 
> {code:java}
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":25},
>     "top_parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
> {toad} {x men} {magneto} {professor xavier}",
>     "top_parsed_phrases":[
>       "darth vader",
>       "obi wan kenobi",
>       "anakin skywalker",
>       "toad",
>       "x-men",
>       "magneto",
>       "professor xavier"],
>       "potential_parsings":[{
>       "parsed_phrases":["darth vader",
>       "obi wan kenobi",
>       "anakin skywalker",
>       "toad",
>       "x-men",
>       "magneto",
>       "professor xavier"],
>       "parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker} 
> {toad} {x-men} {magneto} {professor xavier}",
>     "score":0.0}]}{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9418) Statistical Phrase Identifier

Reply via email to