thanks, I am also working on a rapid model builder framework that I would
like you to look at. I posted a description earlier but no feedback yet, I
was thinking I could check it into the sandbox so everyone can run it,
along with a filebased implementation that includes a file of ~200K
sentences.
This tool would allow users to specify a file of sentences from their data,
a file (dictionary) of known named entities, and a blacklist file (for
false positive reduction) in order to build a model for a specific entity
type.


On Thu, Oct 10, 2013 at 12:00 PM, Jörn Kottmann <[email protected]> wrote:

> I will have a look at it tomorrow, we are planning on using the
> entitylinker in on of
> our systems.
>
> Jörn
>
>
> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> All,
>> Before I plug some tickets into Jira, I wanted to get some feedback from
>> the team on some changes I would like to make to the EntityLinker
>> GeoEntityLinkerImpl
>> Below are what I consider improvement tickets
>>
>> 1. Only the first start and end are populated in CountryContext object
>> when
>> returned from CountryContext.find, it should return all instances of each
>> country mention in a map so the proximity of other toponyms to the found
>> country indicators can be included as a factor in the scoring
>>
>> Currently the user only gets the first indexOf for each country mention.
>> The country mentions are an attempt to better gauge ambiguous names( Paris
>> Texas rather than Paris France). Because of this, I am not able to do a
>> proximity analysis thoroughly to assist in scoring. Basically I need every
>> mention of every country indicator in the doc, which I will correlate with
>> every Named Entity span to produce a score. I am also not passing the list
>> of country codes into the database query as a where predicate, which would
>> improve performance tremendously (I will index the column).
>>
>> 2. Discovery of indicators for "country context" should be regex based, in
>> order to provide a more robust ability to discover context
>>
>> Currenty I use a String.indexOf(term) to discover the country hit list.
>> Regex would allow users to configure interesting ways to indicate
>> countries. Regex will also provide the array of start/end I need for issue
>> 1 from its Matcher.find
>>
>> 3. fuzzy string matching should be part of the scoring, this would allow
>> mysql fuzzy search to return more candidate toponyms.
>>
>> Currently, the search into the MySQL gazateers is using "boolean mode" and
>> each NER result is passed in as a literal string. If I implement a fuzzy
>> string matching based score (do we have one?) the user could turn on
>> "natural language" mode in MySQL then we can generate a score and thresh
>> to
>> allow for more recall on transliterated names etc....
>> I would also like to use proximity to the majority of points in the
>> document as a disambiguation criteria as well.
>>
>> 4. provide a "solution wrapper" for the Geotagging capability
>>
>> In order to make the GeoTagging a bit more "out of the box" functional, I
>> was thinking of creating a class that one calls find(MaxentModel, doc,
>> sentencedetector, EntityLinkerProperties) to abstract the current impl. I
>> know this is not standard practice, just want to see what you all think.
>> This would make it "easier" to get this thing running.
>>
>> thanks!
>> MG
>>
>>
>

Reply via email to