Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaAndNER" page has been changed by ThammeGowda: https://wiki.apache.org/tika/TikaAndNER?action=diff&rev1=2&rev2=3 Comment: Added CoreNLP and Regex documentation <<TableOfContents()>> ==== Activate Named Entity Parser ==== - Before moving ahead to configure NER implementations, org.apache.tika.parser.ner.NamedEntityParser, the parser responsible for handling the name recognition task needs to be enabled for to handle required mime types. This can be done with Tika Config XML file, as follows: + Before moving ahead to configure NER implementations, ''org.apache.tika.parser.ner.NamedEntityParser'', the parser responsible for handling the name recognition task needs to be enabled. This can be done with Tika Config XML file, as follows: {{{#!xml <?xml version="1.0" encoding="UTF-8"?> <properties> @@ -65, +65 @@ java -classpath $NER_RES:$TIKA_APP org.apache.tika.cli.TikaCLI --config=tika-config.xml -m http://people.apache.org/committer-index.html - # Are there metadata with keys starting with "NER_" ? + # Are there any metadata keys starting with "NER_" ? }}} ==== Using Stanford CoreNLP NER ==== - // TODO: Coming soon + The 'org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser' class provides runtime bindings to [[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford CoreNLP CRF classifiers]] for named entity recognition. + + The following steps are necessary to use this NER implementation: + * Add Core NLP library and its dependencies to classpath + * Add models to class path + * Set NER Implementation to CoreNLP + + ===== Tika + CoreNLP in action ===== + {{{#! + + cd /$HOME/src + git clone https://github.com/thammegowda/tika-ner-corenlp.git + cd tika-ner-corenlp + mvn clean compile package assembly:single -PtikaAddon + + #this should produce target/tika-ner-corenlp-addon-*-jar-with-dependencies.jar + export CORE_NLP_JAR=`find $PWD/target/tika-ner-corenlp-addon-*jar-with-dependencies.jar` + + export TIKA_APP={your/path/to/tika-app}/target/tika-app-1.12-SNAPSHOT.jar + + java -Dner.impl.class=org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser \ + -classpath $TIKA_APP:$CORE_NLP_JAR org.apache.tika.cli.TikaCLI \ + --config=tika-config.xml -m http://www.hawking.org.uk + + # Observe metadata keys starting with NER_ + + # To use 3class NER model (Default is 7 class model) + + java -Dner.corenlp.model=edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz \ + -Dner.impl.class=org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser \ + -classpath $TIKA_APP:$CORE_NLP_JAR org.apache.tika.cli.TikaCLI \ + --config=tika-config.xml -m http://www.hawking.org.uk + }}} + + + The CoreNLP CRF classifier recognised the following with in text content of http://www.hawking.org.uk page: + {{{ + NER_DATE: 2009 + NER_DATE: 1963 + NER_DATE: 1663 + NER_DATE: 1982 + NER_DATE: 1979 + NER_LOCATION: Gonville + NER_LOCATION: Einstein + NER_LOCATION: London + NER_LOCATION: Cambridge + NER_LOCATION: Santa Cruz + NER_ORGANIZATION: Leiden University + NER_ORGANIZATION: NASA + NER_ORGANIZATION: CBE + NER_ORGANIZATION: Brief History of Time + NER_ORGANIZATION: University of California + NER_ORGANIZATION: Cambridge Lectures Publications Books Images Films + NER_ORGANIZATION: Caius College + NER_ORGANIZATION: Royal Society + NER_ORGANIZATION: About Stephen The Computer Stephen + NER_ORGANIZATION: US National Academy of Science + NER_ORGANIZATION: Department of Applied Mathematics + NER_ORGANIZATION: ESA + NER_ORGANIZATION: The Universe + NER_ORGANIZATION: Sally Tsui Wong-Avery Director of Research + NER_ORGANIZATION: the University of Cambridge + NER_ORGANIZATION: Theoretical Physics + NER_ORGANIZATION: Baby Universe + NER_PERSON: Einstein + NER_PERSON: P. Oesch + NER_PERSON: R. Bouwens + NER_PERSON: George + NER_PERSON: Stephen Hawking + NER_PERSON: Isaac Newton + NER_PERSON: D. Magee + NER_PERSON: Annie + NER_PERSON: G. Illingworth + NER_PERSON: Stephen + NER_PERSON: Dennis Stanton Avery + }}} + + + ==== Using Regular Expressions ==== - // TODO: Coming Soon + The '''org.apache.tika.parser.ner.regex.RegexNERecogniser''' implementation based on Regular expressions. The following steps are required to use this implementation: + + * Configure regular expressions in 'org/apache/tika/parser/ner/regex/ner-regex.txt' + * Set ``ner.impl.class`` to Regex implementation + + ===== Tika + RegexNER in action ===== + {{{ + # Create a regex file and add it to classpath + export NER_RES=$HOME/tika/tika-ner-resources + mkdir -p $NER_RES + cd $NER_RES + mkdir -p org/apache/tika/parser/ner/regex/ + + echo "PHONE_NUMBER=((\+\d{1,2}\s?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})" > org/apache/tika/parser/ner/regex/ner-regex.txt + echo "EMAIL=([a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?))" >> org/apache/tika/parser/ner/regex/ner-regex.txt + + + java -Dner.impl.class=org.apache.tika.parser.ner.regex.RegexNERecogniser \ + -classpath $NER_RES:$TIKA_APP org.apache.tika.cli.TikaCLI \ + --config=tika-config.xml -m http://www.cs.usc.edu/faculty_staff/faculty + + # Observe values of keys NER_PHONE_NUMBER and NER_EMAIL + + }}} + + ==== Creating a custom NER ==== - // TODO: Coming soon + + * Create a class and implement ''org.apache.tika.parser.ner.NERecogniser'' + * Set class name as value to system property ''ner.impl.class'' similar to Regex or CoreNLP + + ==== Chaining all the above at once ==== - // TODO: Coming soon + Multiple class names can be provided by setting the system property ''ner.impl.class'' to a comma separtes class names + Example : ''-Dner.impl.class=org.apache.tika.parser.ner.opennlp.OpenNLPNERecogniser,org.apache.tika.parser.ner.regex.RegexNERecogniser'' +
