I wrote a quick lucene RAMDirectory in memory index, it looks like a valid
option to hold the gazateers and it provides good text search of course.
The idea is that at runtime the geoentitylinker would pull three files off
disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator
file and lucene index them in memory,. initially this will take a while.
So, deployment wise, you would have to use your tool of choice (ie Puppet)
to distribute the files to each node, or mount a share to each node. My
concern with this approach is that each MR Task runs in it's own JVM, so
each task on each node will consume this much memory unless you do
something interesting with memory mapping. The EntityLinkerProperties file
will support the config of the file locations and whether to use DB or in
mem Lucene...

I am also working on a Postgres version of the gazateer structures and
stored procs.

Thoughts?


On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <[email protected]> wrote:

> On 10/23/2013 01:14 PM, Mark G wrote:
>
>> All that being said, it is totally possible to run an in memory version of
>> the gazateer. Personally, I like the DB approach, it provides a lot of
>> flexibility and power.
>>
>
> Yes, and you can even use a DB to run in-memory which works with the
> current implementation,
> I think I will experiment with that.
>
> I don't really mind using 3 GB memory for it, since my Hadoop servers have
> more than enough anyway,
> and it makes the deployment easier (don't have to deal with installing
> MySQL
> databases and keeping them in sync).
>
> Jörn
>

Reply via email to