Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaAndNER" page has been changed by ThammeGowda: https://wiki.apache.org/tika/TikaAndNER Comment: Named entity recognition New page: === Named Entity Recognition (NER) with Tika === Named Entity Recognition is supported in ''tika-parsers v1.12'' ([[https://issues.apache.org/jira/browse/TIKA-1787|TIKA-1787]]). This page describes the steps required to configure and activate the [[https://github.com/apache/tika/pull/61/files#diff-3a416957f7629e40c8fe5e8fd6c577a3R57|NamedEntityParser]]. <<TableOfContents()>> ==== Activate Named Entity Parser ==== Before moving ahead to configure NER implementations, org.apache.tika.parser.ner.NamedEntityParser, the parser responsible for handling the name recognition task needs to be enabled for to handle required mime types. This can be done with Tika Config XML file, as follows: {{{#!xml <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.ner.NamedEntityParser"> <mime>text/plain</mime> <mime>text/html</mime> <mime>application/xhtml+xml</mime> </parser> </parsers> </properties> }}} Depending on your environment, this configuration has to supplied in the later phases. Note: The NamedEntityParser parser does not restrict mimetypes, it uses Tika's auto detect parser to read text content from non-text streams. ==== Using Apache CoreNLP NER ==== The NE Parser is configured to use an implementation based on [[https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.recognition.api|Apache OpenNLP]]. However, the NER models need to be added to the Tika's classpath to make this work. The following table shows types of entities and the paths to place the model file. ||'''Entity Type'''||'''Path for model''' ||'''URL to get'''|| || PERSON || org/apache/tika/parser/ner/opennlp/ner-person.bin || http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin|| || LOCATION || org/apache/tika/parser/ner/opennlp/ner-location.bin || http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin|| || ORAGANIZATION || org/apache/tika/parser/ner/opennlp/ner-organization.bin || http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin|| || DATE || org/apache/tika/parser/ner/opennlp/ner-date.bin || http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin|| || MONEY || org/apache/tika/parser/ner/opennlp/ner-money.bin || http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin|| || PERCENT || org/apache/tika/parser/ner/opennlp/ner-percentage.bin || http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin || || MONEY || org/apache/tika/parser/ner/opennlp/ner-money.bin || http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin || Notes: 1. You can use any combination of the models. If you are interested in only the LOCATION names, then skip other NER models save LOCATION. 2. NER Models for other languages are also available http://opennlp.sourceforge.net/models-1.5/ . If you choose to use different language, use those URLs in the below script. ===== Tika App + OpenNLP NER in action ===== {{{#!bash #Create a directory for keeping all the models. #Choose any convenient path but make sure to use absolute path export NER_RES=$HOME/tika/tika-ner-resources mkdir -p $NER_RES cd $NER_RES PATH_PREFIX="$NER_RES/org/apache/tika/parser/ner/opennlp" URL_PREFIX="http://opennlp.sourceforge.net/models-1.5" mkdir -p $PATH_PREFIX # using three entity types from the above table for demonstration wget "$URL_PREFIX/en-ner-person.bin" -O $PATH_PREFIX/ner-person.bin wget "$URL_PREFIX/en-ner-location.bin" -O $PATH_PREFIX/ner-person.bin wget "$URL_PREFIX/en-ner-organization.bin" -O $PATH_PREFIX/ner-organization.bin export TIKA_APP={your/path/to/tika-app}/target/tika-app-1.12-SNAPSHOT.jar java -classpath $NER_RES:$TIKA_APP org.apache.tika.cli.TikaCLI --config=tika-config.xml -m http://people.apache.org/committer-index.html # Are there metadata with keys starting with "NER_" ? }}} ==== Using Stanford CoreNLP NER ==== // TODO: Coming soon ==== Using Regular Expressions ==== // TODO: Coming Soon ==== Creating a custom NER ==== // TODO: Coming soon ==== Chaining all the above at once ==== // TODO: Coming soon
