FYI, I implemented an in mem lucene index of the NGA Geonames. It was
almost 7 GB ram and took about 40 minutes to load.
Still looking at other DBs/Indexes. So one would need at least 10G ram to
hold the USGS and NGA gazateers.


On Fri, Oct 25, 2013 at 6:21 AM, Mark G <[email protected]> wrote:

> I wrote a quick lucene RAMDirectory in memory index, it looks like a valid
> option to hold the gazateers and it provides good text search of course.
> The idea is that at runtime the geoentitylinker would pull three files off
> disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator
> file and lucene index them in memory,. initially this will take a while.
> So, deployment wise, you would have to use your tool of choice (ie Puppet)
> to distribute the files to each node, or mount a share to each node. My
> concern with this approach is that each MR Task runs in it's own JVM, so
> each task on each node will consume this much memory unless you do
> something interesting with memory mapping. The EntityLinkerProperties file
> will support the config of the file locations and whether to use DB or in
> mem Lucene...
>
> I am also working on a Postgres version of the gazateer structures and
> stored procs.
>
> Thoughts?
>
>
> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <[email protected]> wrote:
>
>> On 10/23/2013 01:14 PM, Mark G wrote:
>>
>>> All that being said, it is totally possible to run an in memory version
>>> of
>>> the gazateer. Personally, I like the DB approach, it provides a lot of
>>> flexibility and power.
>>>
>>
>> Yes, and you can even use a DB to run in-memory which works with the
>> current implementation,
>> I think I will experiment with that.
>>
>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>> have more than enough anyway,
>> and it makes the deployment easier (don't have to deal with installing
>> MySQL
>> databases and keeping them in sync).
>>
>> Jörn
>>
>
>

Reply via email to