thanks, I am also working on a rapid model builder framework that I would like you to look at. I posted a description earlier but no feedback yet, I was thinking I could check it into the sandbox so everyone can run it, along with a filebased implementation that includes a file of ~200K sentences. This tool would allow users to specify a file of sentences from their data, a file (dictionary) of known named entities, and a blacklist file (for false positive reduction) in order to build a model for a specific entity type.
On Thu, Oct 10, 2013 at 12:00 PM, Jörn Kottmann <[email protected]> wrote: > I will have a look at it tomorrow, we are planning on using the > entitylinker in on of > our systems. > > Jörn > > > On 10/05/2013 11:58 PM, Mark G wrote: > >> All, >> Before I plug some tickets into Jira, I wanted to get some feedback from >> the team on some changes I would like to make to the EntityLinker >> GeoEntityLinkerImpl >> Below are what I consider improvement tickets >> >> 1. Only the first start and end are populated in CountryContext object >> when >> returned from CountryContext.find, it should return all instances of each >> country mention in a map so the proximity of other toponyms to the found >> country indicators can be included as a factor in the scoring >> >> Currently the user only gets the first indexOf for each country mention. >> The country mentions are an attempt to better gauge ambiguous names( Paris >> Texas rather than Paris France). Because of this, I am not able to do a >> proximity analysis thoroughly to assist in scoring. Basically I need every >> mention of every country indicator in the doc, which I will correlate with >> every Named Entity span to produce a score. I am also not passing the list >> of country codes into the database query as a where predicate, which would >> improve performance tremendously (I will index the column). >> >> 2. Discovery of indicators for "country context" should be regex based, in >> order to provide a more robust ability to discover context >> >> Currenty I use a String.indexOf(term) to discover the country hit list. >> Regex would allow users to configure interesting ways to indicate >> countries. Regex will also provide the array of start/end I need for issue >> 1 from its Matcher.find >> >> 3. fuzzy string matching should be part of the scoring, this would allow >> mysql fuzzy search to return more candidate toponyms. >> >> Currently, the search into the MySQL gazateers is using "boolean mode" and >> each NER result is passed in as a literal string. If I implement a fuzzy >> string matching based score (do we have one?) the user could turn on >> "natural language" mode in MySQL then we can generate a score and thresh >> to >> allow for more recall on transliterated names etc.... >> I would also like to use proximity to the majority of points in the >> document as a disambiguation criteria as well. >> >> 4. provide a "solution wrapper" for the Geotagging capability >> >> In order to make the GeoTagging a bit more "out of the box" functional, I >> was thinking of creating a class that one calls find(MaxentModel, doc, >> sentencedetector, EntityLinkerProperties) to abstract the current impl. I >> know this is not standard practice, just want to see what you all think. >> This would make it "easier" to get this thing running. >> >> thanks! >> MG >> >> >
