Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "GeoTopicParser" page has been changed by ChrisMattmann: https://wiki.apache.org/tika/GeoTopicParser?action=diff&rev1=2&rev2=3 Comment: - final docs on TikaGeoTopicParser for v1 GeoTopicParser uses [[http://lucene.apache.org/|Apache Lucene]] and [[http://opennlp.apache.org/|Apache OpenNLP]] to provide its capabilities. - = Installing the Lucene Gazetteer = + == Installing the Lucene Gazetteer == First you will need to download the [[http://github.com/chrismattmann/lucene-geo-gazetteer|Lucene Geo Gazetteer]] project and to install it. You can do so by: @@ -32, +32 @@ Gazetteer for }}} - You will now need to build a Gazetteer using the Geonames.org dataset. Instructions are provided below: + You will now need to build a Gazetteer using the Geonames.org dataset. Instructions are provided below. Note that you will need least 1.2 GB disk space for building Lucene Index for the Gazetteer. {{{ $ cd $HOME/src/lucene-geo-gazetteer @@ -45, +45 @@ {{{ $ lucene-geo-gazetteer -s Pasadena Texas + [ + {"Texas" : [ + "Texas", + "-91.92139", + "18.05333" + ]}, + {"Pasadena" : [ + "Pasadena", + "-74.06446", + "4.6964" + ]} + ] }}} - Note that we used the convenience script `lucene-geo-gazetteer` which assumes that you created an indexed named geoIndex in the $HOME/src/lucene-geo-gazetter/geoIndex directory. We could have also used the pure Java command line to search. + Note that we used the convenience script `lucene-geo-gazetteer` which assumes that you created an indexed named geoIndex in the $HOME/src/lucene-geo-gazetter/geoIndex directory. We could have also used the pure Java command line to search. The return from the Gazetteer is a JSON List of JSON Object structures in which the structure is a key->JSON List map. The key is the location name given and the List is a list of closest match (by Edit Distance) in the Gazetteer for that name, followed by Latitude, and Longitude of that location. + == Installing and downloading an NER model == + + The next thing you'll need is a Named Entity Recognition model for places. The GeoTopicParser uses Apache OpenNLP and with its 1.5 version, OpenNLP provides already trained models for location names in text data. You can download the [[http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin|en-ner-location.bin]] file already pre-trained by the OpenNLP folks. One thing to note is that OpenNLP's default name finder is not accurate, so building your own NER location model is highly recommended. In this case, please follow [[http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training|these instructions]]. + + The model needs to be placed on the classpath for your Tika installation in the following directory: + + {{{org/apache/tika/parser/geo/topic}}} + + The following instructions show how to download the model and place it on the right path: + + {{{ + $ mkdir $HOME/src/location-ner-model && cd $HOME/src/location-ner-model + $ curl -O http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin + $ mkdir -p org/apache/tika/parser/geo/topic + $ mv en-ner-location.bin org/apache/tika/parser/geo/topic + }}} + + == Test out the GeoTopicParser == + + Now you can run Tika and try out the GeoTopicParser. At the moment since it's a Parser and not a Content-Handler (hopefully will develop it later), the parser is mapped to the MIME type application/geotopic which is a sub-class of text/plain. So, there are two steps to try the parser out now. + + 1. Create a .geot file, you can use this sample [[http://github.com/chrismattmann/geotopicparser-utils/polar.geot|file]] from the [[http://github.com/chrismattmann/trec-dd-polar|NSF Polar data contributed to TREC]]. + 2. Tell Tika about the application/geotopic MIME type. You can download this [[http://github.com/chrismattmann/geotopicparser-utils/custom-mimetypes.xml|file]] and place it on the classpath in the `org/apache/tika/mime` directory, e.g., by doing:{{{ + $ mkdir $HOME/src/geotopic-mime && cd $HOME/src/geotopic-mime + $ mkdir -p org/apache/tika/mime + $ curl -O http://github.com/chrismattmann/geotopicparser-utils/custom-mimetypes.xml + $ mv custom-mimetypes.xml org/apache/tika/mime + }}} + + With those files in place, let's use the GeoTopicParser using Tika-App: + + {{{ + $ java -classpath tika-app-1.9-SNAPSHOT.jar:$HOME/src/location-ner-model:$HOME/src/geotopic-mime org.apache.tika.cli.TikaCLI -m polar.geot + }}} + + This should output: + + {{{ + Content-Length: 881 + Content-Type: application/geotopic + Geographic_LATITUDE: 27.33931 + Geographic_LONGITUDE: -108.60288 + Geographic_NAME: China + Optional_LATITUDE1: 39.76 + Optional_LONGITUDE1: -98.5 + Optional_NAME1: United States + X-Parsed-By: org.apache.tika.parser.DefaultParser + X-Parsed-By: org.apache.tika.parser.geo.topic.GeoParser + resourceName: polar.geot + }}} + + The output will output 3-tuples of `{Name, Latitude, Longitude}`. The *best* match for the location is the one that occurs most frequently in the text, and that is provided as `Geographic_NAME`, along with its corresponding `Geographic_LATITUDE` and `Geographic_LONGITUDE`. Places also identified as entities by the NER model in the provide text are also listed as `Optional_NAME*N*`, e.g., `Optional_NAME1` for the 1st alternative location identified and its corresponding `Optional_LATITUDE1` and `Optional_LONGITUDE1`. + + == Will this work from Tika Server? == + + It sure will! When you start Tika Server, make sure that the NER model file and the custom MIME type are on your classpath, and that the lucene-geo-gazetteer is on the `$PATH` where Tika-Server is started, and you can post all the .geot files that you'd like and Tika-Server will happily call the GeoTopicParser to provide you location information. +
