Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "GeoTopicParser" page has been changed by ChrisMattmann:
https://wiki.apache.org/tika/GeoTopicParser?action=diff&rev1=2&rev2=3

Comment:
- final docs on TikaGeoTopicParser for v1

  
  GeoTopicParser uses [[http://lucene.apache.org/|Apache Lucene]] and 
[[http://opennlp.apache.org/|Apache OpenNLP]] to provide its capabilities.
  
- = Installing the Lucene Gazetteer =
+ == Installing the Lucene Gazetteer ==
  
  First you will need to download the 
[[http://github.com/chrismattmann/lucene-geo-gazetteer|Lucene Geo Gazetteer]] 
project and to install it. You can do so by:
  
@@ -32, +32 @@

                                         Gazetteer for
  }}}
  
- You will now need to build a Gazetteer using the Geonames.org dataset. 
Instructions are provided below:
+ You will now need to build a Gazetteer using the Geonames.org dataset. 
Instructions are provided below. Note that you will need least 1.2 GB disk 
space for building Lucene Index for the Gazetteer.
  
  {{{
  $ cd $HOME/src/lucene-geo-gazetteer
@@ -45, +45 @@

  
  {{{
  $ lucene-geo-gazetteer -s Pasadena Texas
+ [
+ {"Texas" : [
+ "Texas",
+ "-91.92139",
+ "18.05333"
+ ]},
+ {"Pasadena" : [
+ "Pasadena",
+ "-74.06446",
+ "4.6964"
+ ]}
+ ]
  }}}
  
- Note that we used the convenience script `lucene-geo-gazetteer` which assumes 
that you created an indexed named geoIndex in the 
$HOME/src/lucene-geo-gazetter/geoIndex directory. We could have also used the 
pure Java command line to search.
+ Note that we used the convenience script `lucene-geo-gazetteer` which assumes 
that you created an indexed named geoIndex in the 
$HOME/src/lucene-geo-gazetter/geoIndex directory. We could have also used the 
pure Java command line to search. The return from the Gazetteer is a JSON List 
of JSON Object structures in which the structure is a key->JSON List map. The 
key is the location name given and the List is a list of closest match (by Edit 
Distance) in the Gazetteer for that name, followed by Latitude, and Longitude 
of that location.
  
+ == Installing and downloading an NER model ==
+ 
+ The next thing you'll need is a Named Entity Recognition model for places. 
The GeoTopicParser uses Apache OpenNLP and with its 1.5 version, OpenNLP 
provides already trained models for location names in text data. You can 
download the 
[[http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin|en-ner-location.bin]]
 file already pre-trained by the OpenNLP folks. One thing to note is that 
OpenNLP's default name finder is not accurate, so building your own NER 
location model is highly recommended. In this case, please follow 
[[http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training|these
 instructions]]. 
+ 
+ The model needs to be placed on the classpath for your Tika installation in 
the following directory:
+ 
+ {{{org/apache/tika/parser/geo/topic}}}
+ 
+ The following instructions show how to download the model and place it on the 
right path:
+ 
+ {{{
+ $ mkdir $HOME/src/location-ner-model && cd $HOME/src/location-ner-model
+ $ curl -O http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
+ $ mkdir -p org/apache/tika/parser/geo/topic
+ $ mv en-ner-location.bin org/apache/tika/parser/geo/topic
+ }}}
+ 
+ == Test out the GeoTopicParser ==
+ 
+ Now you can run Tika and try out the GeoTopicParser. At the moment since it's 
a Parser and not a Content-Handler (hopefully will develop it later), the 
parser is mapped to the MIME type application/geotopic which is a sub-class of 
text/plain. So, there are two steps to try the parser out now.
+ 
+  1. Create a .geot file, you can use this sample 
[[http://github.com/chrismattmann/geotopicparser-utils/polar.geot|file]] from 
the [[http://github.com/chrismattmann/trec-dd-polar|NSF Polar data contributed 
to TREC]].
+  2. Tell Tika about the application/geotopic MIME type. You can download this 
[[http://github.com/chrismattmann/geotopicparser-utils/custom-mimetypes.xml|file]]
 and place it on the classpath in the `org/apache/tika/mime` directory, e.g., 
by doing:{{{
+ $ mkdir $HOME/src/geotopic-mime && cd $HOME/src/geotopic-mime
+ $ mkdir -p org/apache/tika/mime
+ $ curl -O 
http://github.com/chrismattmann/geotopicparser-utils/custom-mimetypes.xml
+ $ mv custom-mimetypes.xml org/apache/tika/mime
+ }}}
+ 
+ With those files in place, let's use the GeoTopicParser using Tika-App:
+ 
+ {{{
+ $ java -classpath 
tika-app-1.9-SNAPSHOT.jar:$HOME/src/location-ner-model:$HOME/src/geotopic-mime 
org.apache.tika.cli.TikaCLI -m polar.geot
+ }}}
+ 
+ This should output:
+ 
+ {{{
+ Content-Length: 881
+ Content-Type: application/geotopic
+ Geographic_LATITUDE: 27.33931
+ Geographic_LONGITUDE: -108.60288
+ Geographic_NAME: China
+ Optional_LATITUDE1: 39.76
+ Optional_LONGITUDE1: -98.5
+ Optional_NAME1: United States
+ X-Parsed-By: org.apache.tika.parser.DefaultParser
+ X-Parsed-By: org.apache.tika.parser.geo.topic.GeoParser
+ resourceName: polar.geot
+ }}}
+ 
+ The output will output 3-tuples of `{Name, Latitude, Longitude}`. The *best* 
match for the location is the one that occurs most frequently in the text, and 
that is provided as `Geographic_NAME`, along with its corresponding 
`Geographic_LATITUDE` and `Geographic_LONGITUDE`. Places also identified as 
entities by the NER model in the provide text are also listed as 
`Optional_NAME*N*`, e.g., `Optional_NAME1` for the 1st alternative location 
identified and its corresponding `Optional_LATITUDE1` and `Optional_LONGITUDE1`.
+ 
+ == Will this work from Tika Server? ==
+ 
+ It sure will! When you start Tika Server, make sure that the NER model file 
and the custom MIME type are on your classpath, and that the 
lucene-geo-gazetteer is on the `$PATH` where Tika-Server is started, and you 
can post all the .geot files that you'd like and Tika-Server will happily call 
the GeoTopicParser to provide you location information.
+ 

Reply via email to