[ 
https://issues.apache.org/jira/browse/SOLR-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482529#comment-16482529
 ] 

David Smiley commented on SOLR-12376:
-------------------------------------

Patch:
* Copied into new package org.apache.solr.handler.tagger
* The source headers are retained from OpenSextant.  NOTICE.txt updated with 
legal mumbo-jumbo.  BTW IntelliJ annoyingly replaced the headers with the ASF 
one when I copied the files between projects (!) so I manually updated each 
one.  It didn't seem to honor the copyright feature settings to not update 
existing copyrights, at least not in this scenario.  Ugh.
* Removed the htmlOffsetAdjust option with supporting class & test.  I altered 
TaggerRequestHandler accordingly but made it possible via sub-class extension 
so that it could be added externally (though the change for this is a little 
clumsy).  I don't want to add additional dependencies (Jericho HTML Parser, 
ASLv2 licensed), _at least not at this time_.  And in retrospect I've wondered 
if the underlying feature here could be accomplished in a better way.
** Note that the xmlOffsetAdjust expressly depends on Woodstox, which is 
already included with Solr.
* Removed @author tags
* Copied the test config into test collection1 as solrconfig-tagger.xml and 
schema-tagger.xml
** Replaced the OpenSextant fully qualified package name of the handler with 
"solr.TaggerRequestHandler".
*** modified SolrResourceLoader.packages to include "handler.tagger." due to 
the sub-package
** Replaced the OpenSextant package name of the ConcatenateFilter to 
"solr.ConcatenateFilter" which now works.  (we depend on LUCENE-8323)
** Merged the TaggingAttribute test config into this config since it was easy 
to do and avoids bloating with yet another config
* Removed legacy support of configuration which allowed top level settings in 
the request handler as implied invariants.

TODO docs

> New TaggerRequestHandler (aka SolrTextTagger)
> ---------------------------------------------
>
>                 Key: SOLR-12376
>                 URL: https://issues.apache.org/jira/browse/SOLR-12376
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: SOLR-12376.patch
>
>
> This issue introduces a new RequestHandler: {{TaggerRequestHandler}}, AKA the 
> SolrTextTagger from the OpenSextant project 
> [https://github.com/OpenSextant/SolrTextTagger]. It's used for named entity 
> recognition (NER) of text past to it. It doesn't do any NLP (outside of 
> Lucene text analysis) so it's said to be a "naive tagger", but it's 
> definitely useful as-is and a more complete NER or ERD (entity recognition 
> and disambiguation) system can be built with this as a key component. The 
> SolrTextTagger has been used on queries for query-understanding, and it's 
> been used on full-text, and it's been used on dictionaries that number tens 
> of millions in size. Since it's small and has been used a bunch (including 
> helping win an ERD competition and in [Apache 
> Stanbol|https://stanbol.apache.org/]), several people have asked me when or 
> why isn't this in Solr yet. So here it is.
> To use it, first you need a collection of documents that have a name-like 
> field (short text) indexed with the ConcatenateFilter (LUCENE-8323) at the 
> end. We call this the dictionary. Once that's in place, you simply post text 
> to a {{TaggerRequestHandler}} and it returns the offset pairs into that text 
> for matches in the dictionary along with the uniqueKey of the matching 
> documents. It can also return other document data desired. That's the gist; 
> I'll add more details on use to the Solr Reference Guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to