Anya Yun Li created TIKA-1614:
---------------------------------
Summary: Geo Topic Parser
Key: TIKA-1614
URL: https://issues.apache.org/jira/browse/TIKA-1614
Project: Tika
Issue Type: New Feature
Components: parser
Reporter: Anya Yun Li
##Description
This program aims to provide the support to identify geonames for any
unstructured text data in the project NSF polar research.
https://github.com/NSF-Polar-Cyberinfrastructure/datavis-hackathon/issues/1
This project is a content-based geotagging solution, made of a variaty of NLP
tools and could be used for any geotagging purposes.
##Workingflow
1. Plain text input is passed to geoparser
2. Location names are extracted from the text using OpenNLP NER
3. Provide two roles:
* The most frequent location name choosed as the best match for the
input text
* Other extracted locations are treated as alternatives (equal)
4. location extracted above, search the best GeoName object and return the
resloved objects with fields (name in gazetteer, longitude, latitude)
##How to Use
*Cautions*: This program requires at least 1.2 GB disk space for building
Lucene Index
```Java
function A(stream){
Metadata metadata = new Metadata();
ParseContext context=new ParseContext();
GeoParserConfig config= new GeoParserConfig();
config.setGazetterPath(gazetteerPath);
config.setNERModelPath(nerPath);
context.set(GeoParserConfig.class, config);
geoparser.parse(
stream,
new BodyContentHandler(),
metadata,
context);
for(String name: metadata.names()){
String value=metadata.get(name);
System.out.println(name +" " + value);
}
}
```
This parser generates useful geographical information to Tika's Metadata
Object.
Fields for best matched location:
```
Geographic_NAME
Geographic_LONGTITUDE
Geographic_LATITUDE
```
Fields for alternatives:
```
Geographic_NAME1
Geographic_LONGTITUDE1
Geographic_LATITUDE1
Geographic_NAME2
Geographic_LONGTITUDE2
Geographic_LATITUDE2
...
```
If you have any questions, contact me: [email protected]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)