Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "TikaAndNER" page has been changed by ThammeGowda:
https://wiki.apache.org/tika/TikaAndNER?action=diff&rev1=2&rev2=3

Comment:
Added CoreNLP and Regex documentation 

  <<TableOfContents()>>
  
  ==== Activate Named Entity Parser ====
-     Before moving ahead to configure NER implementations, 
org.apache.tika.parser.ner.NamedEntityParser, the parser responsible for 
handling the name recognition task needs to be enabled for to handle required 
mime types. This can be done with Tika Config XML file, as follows:
+     Before moving ahead to configure NER implementations, 
''org.apache.tika.parser.ner.NamedEntityParser'', the parser responsible for 
handling the name recognition task needs to be enabled. This can be done with 
Tika Config XML file, as follows:
  {{{#!xml
  <?xml version="1.0" encoding="UTF-8"?>
  <properties>
@@ -65, +65 @@

  
  java -classpath $NER_RES:$TIKA_APP org.apache.tika.cli.TikaCLI 
--config=tika-config.xml -m http://people.apache.org/committer-index.html
  
- # Are there metadata with keys starting with "NER_" ?
+ # Are there any metadata keys starting with "NER_" ?
  
  }}}
  
  
  ==== Using Stanford CoreNLP NER ====
- // TODO: Coming soon
+    The 'org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser' class 
provides runtime bindings to 
[[http://nlp.stanford.edu/software/CRF-NER.shtml|Stanford CoreNLP CRF 
classifiers]] for named entity recognition.
+ 
+ The following steps are necessary to use this NER implementation:
+  * Add Core NLP library and its dependencies to classpath
+  * Add models to class path
+  * Set NER Implementation to CoreNLP
+ 
+ ===== Tika + CoreNLP in action =====
+ {{{#!
+ 
+ cd /$HOME/src 
+ git clone https://github.com/thammegowda/tika-ner-corenlp.git
+ cd tika-ner-corenlp
+ mvn clean compile package assembly:single -PtikaAddon
+ 
+ #this should produce target/tika-ner-corenlp-addon-*-jar-with-dependencies.jar
+ export CORE_NLP_JAR=`find 
$PWD/target/tika-ner-corenlp-addon-*jar-with-dependencies.jar`
+ 
+ export TIKA_APP={your/path/to/tika-app}/target/tika-app-1.12-SNAPSHOT.jar
+ 
+ java -Dner.impl.class=org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser \
+       -classpath $TIKA_APP:$CORE_NLP_JAR org.apache.tika.cli.TikaCLI \
+       --config=tika-config.xml -m http://www.hawking.org.uk
+ 
+ # Observe metadata keys starting with NER_ 
+ 
+ # To use 3class NER model  (Default is 7 class model)
+ 
+  java  
-Dner.corenlp.model=edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
 \
+        
-Dner.impl.class=org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser \
+        -classpath $TIKA_APP:$CORE_NLP_JAR org.apache.tika.cli.TikaCLI \
+        --config=tika-config.xml -m http://www.hawking.org.uk
+ }}}
+ 
+ 
+ The CoreNLP CRF classifier recognised the following with in text content of 
http://www.hawking.org.uk page:
+ {{{
+ NER_DATE: 2009
+ NER_DATE: 1963
+ NER_DATE: 1663
+ NER_DATE: 1982
+ NER_DATE: 1979
+ NER_LOCATION: Gonville
+ NER_LOCATION: Einstein
+ NER_LOCATION: London
+ NER_LOCATION: Cambridge
+ NER_LOCATION: Santa Cruz
+ NER_ORGANIZATION: Leiden University
+ NER_ORGANIZATION: NASA
+ NER_ORGANIZATION: CBE
+ NER_ORGANIZATION: Brief History of Time
+ NER_ORGANIZATION: University of California
+ NER_ORGANIZATION: Cambridge Lectures Publications Books Images Films
+ NER_ORGANIZATION: Caius College
+ NER_ORGANIZATION: Royal Society
+ NER_ORGANIZATION: About Stephen The Computer Stephen
+ NER_ORGANIZATION: US National Academy of Science
+ NER_ORGANIZATION: Department of Applied Mathematics
+ NER_ORGANIZATION: ESA
+ NER_ORGANIZATION: The Universe
+ NER_ORGANIZATION: Sally Tsui Wong-Avery Director of Research
+ NER_ORGANIZATION: the University of Cambridge
+ NER_ORGANIZATION: Theoretical Physics
+ NER_ORGANIZATION: Baby Universe
+ NER_PERSON: Einstein
+ NER_PERSON: P. Oesch
+ NER_PERSON: R. Bouwens
+ NER_PERSON: George
+ NER_PERSON: Stephen Hawking
+ NER_PERSON: Isaac Newton
+ NER_PERSON: D. Magee
+ NER_PERSON: Annie
+ NER_PERSON: G. Illingworth
+ NER_PERSON: Stephen
+ NER_PERSON: Dennis Stanton Avery
+ }}}
+ 
+ 
+ 
  ==== Using Regular Expressions ====
- // TODO: Coming Soon
+ The '''org.apache.tika.parser.ner.regex.RegexNERecogniser''' implementation 
based on Regular expressions. The following steps are required to use this 
implementation:
+ 
+  * Configure regular expressions in 
'org/apache/tika/parser/ner/regex/ner-regex.txt'
+  * Set ``ner.impl.class`` to Regex implementation
+ 
+ ===== Tika + RegexNER in action =====
+ {{{
+ # Create a regex file and add it to classpath
+ export NER_RES=$HOME/tika/tika-ner-resources
+ mkdir -p $NER_RES
+ cd $NER_RES
+ mkdir -p org/apache/tika/parser/ner/regex/
+ 
+ echo "PHONE_NUMBER=((\+\d{1,2}\s?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})" > 
org/apache/tika/parser/ner/regex/ner-regex.txt
+ echo 
"EMAIL=([a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?))"
 >> org/apache/tika/parser/ner/regex/ner-regex.txt
+ 
+ 
+ java -Dner.impl.class=org.apache.tika.parser.ner.regex.RegexNERecogniser \
+     -classpath $NER_RES:$TIKA_APP org.apache.tika.cli.TikaCLI \
+     --config=tika-config.xml -m  http://www.cs.usc.edu/faculty_staff/faculty
+ 
+ # Observe values of keys NER_PHONE_NUMBER and NER_EMAIL
+ 
+ }}}
+ 
+ 
  ==== Creating a custom NER ====
- // TODO: Coming soon
+ 
+  * Create a class and implement ''org.apache.tika.parser.ner.NERecogniser''
+  * Set class name as value to system property ''ner.impl.class'' similar to 
Regex or CoreNLP
+ 
+ 
  ==== Chaining all the above at once ====
- // TODO: Coming soon
  
+  Multiple class names can be provided by setting the system property 
''ner.impl.class'' to a comma separtes class names
+ Example : 
''-Dner.impl.class=org.apache.tika.parser.ner.opennlp.OpenNLPNERecogniser,org.apache.tika.parser.ner.regex.RegexNERecogniser''
+ 

Reply via email to