[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-06-07 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505143#comment-16505143
 ] 

ASF subversion and git services commented on SOLR-12376:


Commit d7abebd7af83ea08d7bdea94077212ffd7c5efe1 in lucene-solr's branch 
refs/heads/master from [~ctargett]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d7abebd ]

SOLR-12376: Add some links to other Ref Guide pages; minor format & typo cleanup


> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-06-07 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505147#comment-16505147
 ] 

ASF subversion and git services commented on SOLR-12376:


Commit 7cff08c7a844f26bd292f8408b1aa3c6a8bec86f in lucene-solr's branch 
refs/heads/branch_7_4 from [~ctargett]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7cff08c ]

SOLR-12376: Add some links to other Ref Guide pages; minor format & typo cleanup


> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-06-07 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505146#comment-16505146
 ] 

ASF subversion and git services commented on SOLR-12376:


Commit c01287d7b34293d9ae7b0abcd1bf66334f9d5138 in lucene-solr's branch 
refs/heads/branch_7x from [~ctargett]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c01287d ]

SOLR-12376: Add some links to other Ref Guide pages; minor format & typo cleanup


> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-06-06 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503744#comment-16503744
 ] 

ASF subversion and git services commented on SOLR-12376:


Commit 33b1c1d1416ed3b8dbce4066ad4b982a15e1b0d0 in lucene-solr's branch 
refs/heads/branch_7x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=33b1c1d ]

SOLR-12376: AwaitsFix testStopWords pending LUCENE-8344

(cherry picked from commit 7c6d743)


> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-06-06 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503740#comment-16503740
 ] 

ASF subversion and git services commented on SOLR-12376:


Commit 7c6d74376a784224963b57cb8380a07279fd7608 in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7c6d743 ]

SOLR-12376: AwaitsFix testStopWords pending LUCENE-8344


> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-06-05 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502252#comment-16502252
 ] 

ASF subversion and git services commented on SOLR-12376:


Commit 39bec86593ae72ca7799c2e331ace7679e628dda in lucene-solr's branch 
refs/heads/branch_7x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=39bec86 ]

SOLR-12376: New TaggerRequestHandler (SolrTextTagger).

(cherry picked from commit cf63392)


> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-06-05 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502219#comment-16502219
 ] 

ASF subversion and git services commented on SOLR-12376:


Commit cf63392183ffc96428fc4c52f546fec2cdf766d5 in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf63392 ]

SOLR-12376: New TaggerRequestHandler (SolrTextTagger).


> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-05-28 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492897#comment-16492897
 ] 

David Smiley commented on SOLR-12376:
-

Updated patch to use the new ConcatenateGraphFilterFactory (which is a WIP; not 
committed yet LUCENE-8332). CGFF supports synonyms and other filters producing 
stacked tokens at indexing time. This is _very_ useful for the tagger!
* I added a test for this -- testWDF to test that WordDelimiterGraphFilter 
works with catenation options.
* partial tagging (via shingling) is no longer easily supported so I commented 
this out. It has to do with difficulties in configuring the separator char 
(CGFF doesn't have this configurable). This feature is probably dubious any way.

Added docs, which was an amalgamation of the SolrTextTagger's existing README 
and QUICK_START files hand-edited/massaged some. I verified the tutorial 
instructions. I added a bin/post version of sending the CSV.  That was a bit of 
a pain to figure out.

At this point it's ready but pending LUCENE-8332.  

> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-05-21 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482693#comment-16482693
 ] 

David Smiley commented on SOLR-12376:
-

Updated patch that passes precommit; there were some little things addressed 
with this.

TODO docs.

> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch, SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12376) New TaggerRequestHandler (aka SolrTextTagger)

2018-05-21 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482529#comment-16482529
 ] 

David Smiley commented on SOLR-12376:
-

Patch:
* Copied into new package org.apache.solr.handler.tagger
* The source headers are retained from OpenSextant.  NOTICE.txt updated with 
legal mumbo-jumbo.  BTW IntelliJ annoyingly replaced the headers with the ASF 
one when I copied the files between projects (!) so I manually updated each 
one.  It didn't seem to honor the copyright feature settings to not update 
existing copyrights, at least not in this scenario.  Ugh.
* Removed the htmlOffsetAdjust option with supporting class & test.  I altered 
TaggerRequestHandler accordingly but made it possible via sub-class extension 
so that it could be added externally (though the change for this is a little 
clumsy).  I don't want to add additional dependencies (Jericho HTML Parser, 
ASLv2 licensed), _at least not at this time_.  And in retrospect I've wondered 
if the underlying feature here could be accomplished in a better way.
** Note that the xmlOffsetAdjust expressly depends on Woodstox, which is 
already included with Solr.
* Removed @author tags
* Copied the test config into test collection1 as solrconfig-tagger.xml and 
schema-tagger.xml
** Replaced the OpenSextant fully qualified package name of the handler with 
"solr.TaggerRequestHandler".
*** modified SolrResourceLoader.packages to include "handler.tagger." due to 
the sub-package
** Replaced the OpenSextant package name of the ConcatenateFilter to 
"solr.ConcatenateFilter" which now works.  (we depend on LUCENE-8323)
** Merged the TaggingAttribute test config into this config since it was easy 
to do and avoids bloating with yet another config
* Removed legacy support of configuration which allowed top level settings in 
the request handler as implied invariants.

TODO docs

> New TaggerRequestHandler (aka SolrTextTagger)
> -
>
> Key: SOLR-12376
> URL: https://issues.apache.org/jira/browse/SOLR-12376
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 7.4
>
> Attachments: SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org