[
https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916697#comment-16916697
]
David Smiley commented on SOLR-9418:
------------------------------------
I just want to point out that there's no trace of this in the Solr Reference
Guide, and as-such is basically a hidden feature.
> Statistical Phrase Identifier
> -----------------------------
>
> Key: SOLR-9418
> URL: https://issues.apache.org/jira/browse/SOLR-9418
> Project: Solr
> Issue Type: New Feature
> Reporter: Akash Mehta
> Assignee: Hoss Man
> Priority: Major
> Fix For: 7.5, 8.0
>
> Attachments: SOLR-9418.patch, SOLR-9418.patch, SOLR-9418.patch,
> SOLR-9418.zip
>
>
> h2. *Summary:*
> The Statistical Phrase Identifier is a Solr contribution that takes in a
> string of text and then leverages a language model (an Apache Lucene/Solr
> inverted index) to predict how the inputted text should be divided into
> phrases. The intended purpose of this tool is to parse short-text queries
> into phrases prior to executing a keyword search (as opposed parsing out each
> keyword as a single term).
> It is being generously donated to the Solr project by CareerBuilder, with the
> original source code and a quickly demo-able version located here:
> [https://github.com/careerbuilder/statistical-phrase-identifier|https://github.com/careerbuilder/statistical-phrase-identifier,]
> h2. *Purpose:*
> Assume you're building a job search engine, and one of your users searches
> for the following:
> _machine learning research and development Portland, OR software engineer
> AND hadoop, java_
> Most search engines will natively parse this query into the following boolean
> representation:
> _(machine AND learning AND research AND development AND Portland) OR
> (software AND engineer AND hadoop AND java)_
> While this query may still yield relevant results, it is clear that the
> intent of the user wasn't understood very well at all. By leveraging the
> Statistical Phrase Identifier on this string prior to query parsing, you can
> instead expect the following parsing:
> _{machine learning} \{and} \{research and development} \{Portland, OR}
> \{software engineer} \{AND} \{hadoop,} \{java}_
> It is then possile to modify all the multi-word phrases prior to executing
> the search:
> _"machine learning" and "research and development" "Portland, OR" "software
> engineer" AND hadoop, java_
> Of course, you could do your own query parsing to specifically handle the
> boolean syntax, but the following would eventually be interpreted correctly
> by Apache Solr and most other search engines:
> _"machine learning" AND "research and development" AND "Portland, OR" AND
> "software engineer" AND hadoop AND java_
> h2. *History:*
> This project was originally implemented by the search team at CareerBuilder
> in the summer of 2015 for use as part of their semantic search system. In the
> summer of 2016, Akash Mehta, implemented a much simpler version as a proof of
> concept based upon publicly available information about the CareerBuilder
> implementation (the first attached patch). In July of 2018, CareerBuilder
> open sourced their original version
> ([https://github.com/careerbuilder/statistical-phrase-identifier),|https://github.com/careerbuilder/statistical-phrase-identifier,]
> and agreed to also donate the code to the Apache Software foundation as a
> Solr contribution. An Solr patch with the CareerBuilder version was added to
> this issue on September 5th, 2018, and community feedback and contributions
> are encouraged.
> This issue was originally titled the "Probabilistic Query Parser", but the
> name has now been updated to "Statistical Phrase Identifier" to avoid
> ambiguity with Solr's query parsers (per some of the feedback on this issue),
> as the implementation is actually just a mechanism for identifying phrases
> statistically from a string and is NOT a Solr query parser.
> h2. *Example usage:*
> h3. (See contrib readme or configuration files in the patch for full
> configuration details)
> h3. *{{Request:}}*
> {code:java}
> http://localhost:8983/solr/spi/parse?q=darth vader obi wan kenobi anakin
> skywalker toad x men magneto professor xavier{code}
> h3. *{{Response:}}*
> {code:java}
> {
> "responseHeader":{
> "status":0,
> "QTime":25},
> "top_parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker}
> {toad} {x men} {magneto} {professor xavier}",
> "top_parsed_phrases":[
> "darth vader",
> "obi wan kenobi",
> "anakin skywalker",
> "toad",
> "x-men",
> "magneto",
> "professor xavier"],
> "potential_parsings":[{
> "parsed_phrases":["darth vader",
> "obi wan kenobi",
> "anakin skywalker",
> "toad",
> "x-men",
> "magneto",
> "professor xavier"],
> "parsed_query":"{darth vader} {obi wan kenobi} {anakin skywalker}
> {toad} {x-men} {magneto} {professor xavier}",
> "score":0.0}]}{code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]