Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "TikaAndNER" page has been changed by ThammeGowda:
https://wiki.apache.org/tika/TikaAndNER

Comment:
Named entity recognition 

New page:
=== Named Entity Recognition (NER) with Tika ===

Named Entity Recognition is supported in ''tika-parsers v1.12'' 
([[https://issues.apache.org/jira/browse/TIKA-1787|TIKA-1787]]). This page 
describes the steps required to configure and activate the 
[[https://github.com/apache/tika/pull/61/files#diff-3a416957f7629e40c8fe5e8fd6c577a3R57|NamedEntityParser]].

<<TableOfContents()>>

==== Activate Named Entity Parser ====
    Before moving ahead to configure NER implementations, 
org.apache.tika.parser.ner.NamedEntityParser, the parser responsible for 
handling the name recognition task needs to be enabled for to handle required 
mime types. This can be done with Tika Config XML file, as follows:
{{{#!xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.ner.NamedEntityParser">
            <mime>text/plain</mime>
            <mime>text/html</mime>
            <mime>application/xhtml+xml</mime>
        </parser>
    </parsers>
</properties>
}}}
Depending on your environment, this configuration has to supplied in the later 
phases.
Note: The  NamedEntityParser parser does not restrict mimetypes, it uses Tika's 
auto detect parser to read text content from non-text streams.


==== Using Apache CoreNLP NER ====
    The NE Parser is configured to use an implementation based on 
[[https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.recognition.api|Apache
 OpenNLP]]. However, the NER models need to be added to the Tika's classpath to 
make this work.

The following table shows types of entities and the paths to place the model 
file.

||'''Entity Type'''||'''Path for model'''                                     
||'''URL to get'''||
|| PERSON          || org/apache/tika/parser/ner/opennlp/ner-person.bin       
|| http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin||
|| LOCATION        || org/apache/tika/parser/ner/opennlp/ner-location.bin     
|| http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin||
|| ORAGANIZATION   || org/apache/tika/parser/ner/opennlp/ner-organization.bin 
|| http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin||
|| DATE            || org/apache/tika/parser/ner/opennlp/ner-date.bin         
|| http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin||
|| MONEY           || org/apache/tika/parser/ner/opennlp/ner-money.bin        
|| http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin||
|| PERCENT         || org/apache/tika/parser/ner/opennlp/ner-percentage.bin   
|| http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin ||
|| MONEY           || org/apache/tika/parser/ner/opennlp/ner-money.bin        
|| http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin ||

Notes:
    1. You can use any combination of the models. If you are interested in only 
the LOCATION names, then skip other NER models save LOCATION.
    2. NER Models for other languages are also available 
http://opennlp.sourceforge.net/models-1.5/ . If you choose to use different 
language, use those URLs in the below script.

===== Tika App + OpenNLP NER in action =====

{{{#!bash
#Create a directory for keeping all the models.
#Choose any convenient path but make sure to use absolute path
export NER_RES=$HOME/tika/tika-ner-resources
mkdir -p $NER_RES
cd $NER_RES

PATH_PREFIX="$NER_RES/org/apache/tika/parser/ner/opennlp"
URL_PREFIX="http://opennlp.sourceforge.net/models-1.5";

mkdir -p $PATH_PREFIX

# using three entity types from the above table for demonstration
wget "$URL_PREFIX/en-ner-person.bin" -O $PATH_PREFIX/ner-person.bin
wget "$URL_PREFIX/en-ner-location.bin" -O $PATH_PREFIX/ner-person.bin
wget "$URL_PREFIX/en-ner-organization.bin" -O $PATH_PREFIX/ner-organization.bin

export TIKA_APP={your/path/to/tika-app}/target/tika-app-1.12-SNAPSHOT.jar

java -classpath $NER_RES:$TIKA_APP org.apache.tika.cli.TikaCLI 
--config=tika-config.xml -m http://people.apache.org/committer-index.html

# Are there metadata with keys starting with "NER_" ?

}}}


==== Using Stanford CoreNLP NER ====
// TODO: Coming soon
==== Using Regular Expressions ====
// TODO: Coming Soon
==== Creating a custom NER ====
// TODO: Coming soon
==== Chaining all the above at once ====
// TODO: Coming soon

Reply via email to