Hello Mark,
that is an issue which will go into opennlp-tools for the next release.
To make it show up in our automatically generated issue list the fix
version has
to be set to 1.6.0.
Can you please reopen the issue, set the "Fix Version" to 1.6.0 and
close it again?
Thanks,
Jörn
On 03/11/2014 11:53 AM, Mark Giaconia (JIRA) wrote:
[
https://issues.apache.org/jira/browse/OPENNLP-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Giaconia closed OPENNLP-643.
---------------------------------
Resolution: Fixed
Initial impl committed, continuous improvement will occur over time.
Provide default rule based (regex) name finders (phone num, url, email, coords)
-------------------------------------------------------------------------------
Key: OPENNLP-643
URL: https://issues.apache.org/jira/browse/OPENNLP-643
Project: OpenNLP
Issue Type: New Feature
Components: Name Finder
Affects Versions: 1.6.0
Reporter: Mark Giaconia
Assignee: Mark Giaconia
Priority: Minor
It would be nice if OpenNLP came with some basic rule based namefinders
(RegexNameFinders) for basic types. Initially I would like to create an engine
that runs phonenum,. email, url, MGRS, and DD Lat Lon.
Also, we need a framework for loading additional regexes other than the
defaults mentioned above.
Here is my initial thought... a class that has a set of default types and
patterns in a map that runs the RegexNameFinder, with optional constructors to
override the map, or read from a config file.
Let me know what you think...
/**
*
* Constructs a set of RegexNameFinders from configuration or from a provided
Map
*/
public class RuleBasedEntityFinderEngine {
private static final String PHONE_REGEX = "";
private static final String EMAIL_REGEX = "";
private static final String URL_REGEX = "";
private static final String MGRS_REGEX = "";
private static final String DDLATLON_REGEX = "";
private static final String PHONE_REGEX_TYPE = "phone number";
private static final String EMAIL_REGEX_TYPE = "email";
private static final String URL_REGEX_TYPE = "url";
private static final String MGRS_REGEX_TYPE = "MGRS coord";
private static final String DDLATLON_REGEX_TYPE = "DD coord";
private Map<String, Pattern[]> typePatternMap = new HashMap<>();
Properties properties;
/**
* Loads a set of patterns via configuration. The file should have the entity
* type with no spaces, followed by the regex. For types that have multiple
* regexes, duplicate the type on each line. for example: phone_num <phonenum
* regex1>
* phone_num <phonenum regex2>
* email <regex1>
* Each entry will be loaded in order from top to bottom of file, so if order
* matters list regexes accordingly from top to bottom
*
* @param properties the inputStream of props from which to load the
* regexes from
* @param includeDefaults when true, adds the defaults to the map. if there
is
* key collision in the map, the default will
override.
* @throws IOException
*/
public RuleBasedEntityFinderEngine(InputStream properties, boolean
includeDefaults) throws IOException {
this.properties = new Properties();
this.properties.load(properties);
init();
}
/**
*
* @param typePatternMap a map of name types (i.e. phone number, email...)
to
* an array of regex Patterns. This map is the basis
* for instantiating regexnamefinders
* @param includeDefaults when true, add the defaults to the map. if there is
* key collision in the map, the default will
override.
*/
public RuleBasedEntityFinderEngine(Map<String, Pattern[]> typePatternMap,
boolean includeDefaults) {
this.typePatternMap = typePatternMap;
if (includeDefaults) {
init();
}
}
/**
* loads default regexs and types into the map
*/
private void init() {
if (properties != null) {
//get the regexes from config somewhere
/**
*TODO
*/
} else {
typePatternMap.put(PHONE_REGEX_TYPE, new
Pattern[]{Pattern.compile(PHONE_REGEX)});
typePatternMap.put(EMAIL_REGEX_TYPE, new
Pattern[]{Pattern.compile(EMAIL_REGEX)});
typePatternMap.put(URL_REGEX_TYPE, new
Pattern[]{Pattern.compile(URL_REGEX)});
typePatternMap.put(MGRS_REGEX_TYPE, new
Pattern[]{Pattern.compile(MGRS_REGEX)});
typePatternMap.put(DDLATLON_REGEX_TYPE, new
Pattern[]{Pattern.compile(DDLATLON_REGEX)});
//load the default regexes
}
}
public Map<String, Span[]> find(String[] tokens) {
Map<String, Span[]> outSpans = new HashMap<>();
if (typePatternMap != null) {
for (Map.Entry<String, Pattern[]> finder : typePatternMap.entrySet()) {
RegexNameFinder nf = new RegexNameFinder(finder.getValue(),
finder.getKey());
Span[] spans = nf.find(tokens);
outSpans.put(finder.getKey(), spans);
}
}
return outSpans;
}
public Map<String, Pattern[]> getTypePatternMap() {
init();
return typePatternMap;
}
}
--
This message was sent by Atlassian JIRA
(v6.2#6252)