All, Before I plug some tickets into Jira, I wanted to get some feedback from the team on some changes I would like to make to the EntityLinker GeoEntityLinkerImpl Below are what I consider improvement tickets
1. Only the first start and end are populated in CountryContext object when returned from CountryContext.find, it should return all instances of each country mention in a map so the proximity of other toponyms to the found country indicators can be included as a factor in the scoring Currently the user only gets the first indexOf for each country mention. The country mentions are an attempt to better gauge ambiguous names( Paris Texas rather than Paris France). Because of this, I am not able to do a proximity analysis thoroughly to assist in scoring. Basically I need every mention of every country indicator in the doc, which I will correlate with every Named Entity span to produce a score. I am also not passing the list of country codes into the database query as a where predicate, which would improve performance tremendously (I will index the column). 2. Discovery of indicators for "country context" should be regex based, in order to provide a more robust ability to discover context Currenty I use a String.indexOf(term) to discover the country hit list. Regex would allow users to configure interesting ways to indicate countries. Regex will also provide the array of start/end I need for issue 1 from its Matcher.find 3. fuzzy string matching should be part of the scoring, this would allow mysql fuzzy search to return more candidate toponyms. Currently, the search into the MySQL gazateers is using "boolean mode" and each NER result is passed in as a literal string. If I implement a fuzzy string matching based score (do we have one?) the user could turn on "natural language" mode in MySQL then we can generate a score and thresh to allow for more recall on transliterated names etc.... I would also like to use proximity to the majority of points in the document as a disambiguation criteria as well. 4. provide a "solution wrapper" for the Geotagging capability In order to make the GeoTagging a bit more "out of the box" functional, I was thinking of creating a class that one calls find(MaxentModel, doc, sentencedetector, EntityLinkerProperties) to abstract the current impl. I know this is not standard practice, just want to see what you all think. This would make it "easier" to get this thing running. thanks! MG
