The database is only about 3GB of storage right now.Since I used pure JDBC
and JDBC style stored proc calls, it can run with any JDBC driver, and all
the connection props are in the EntityLinkerProperties file, so it can run
on other database engines. Currently it is optional to use the MySQL fuzzy
string matching, all one has to do is change the stored proc to boolean
mode rather than natural language mode. If you really mean, do we have to
use mysql FULL TEXT *INDEXING*, then no, but with around 10Million
toponymns it provides super fast lookups without consuming a lot of memory.
If I was running the OpenNLP GeoEntityLinker in say, Map Reduce, and I am
running multiple tasks on each node, I would not want to pull 3GB into
memory for each task. The way it is now one could distribute MySQL to each
node via something like Puppet and it would serve requests from the tasks
on that node. Or if they have a beefy server they could make one large
instance of MySQL and have each node connect from the cluster.
All that being said, it is totally possible to run an in memory version of
the gazateer. Personally, I like the DB approach, it provides a lot of
flexibility and power.


On Tue, Oct 22, 2013 at 2:39 PM, Jörn Kottmann <[email protected]> wrote:

> On 10/05/2013 11:58 PM, Mark G wrote:
>
>> 3. fuzzy string matching should be part of the scoring, this would allow
>> mysql fuzzy search to return more candidate toponyms.
>>
>> Currently, the search into the MySQL gazateers is using "boolean mode" and
>> each NER result is passed in as a literal string. If I implement a fuzzy
>> string matching based score (do we have one?) the user could turn on
>> "natural language" mode in MySQL then we can generate a score and thresh
>> to
>> allow for more recall on transliterated names etc....
>> I would also like to use proximity to the majority of points in the
>> document as a disambiguation criteria as well.
>>
>
> It would probably be nice if this would work with other databases too,
> e.g. Apache Derby,
> or some in-memory database, maybe even Lucene.
>
> Would it be possible to not use the MySQL fuzzy string matching feature
> for this?
>
> I would like to run your code, but its difficult to scale the MySQL
> database in my scenario,
> but I have lots of RAM and believe the geonames dataset could fit into it
> to provide
> super fast lookups for me on my worker servers.
>
> Jörn
>

Reply via email to