FYI, I implemented an in mem lucene index of the NGA Geonames. It was almost 7 GB ram and took about 40 minutes to load. Still looking at other DBs/Indexes. So one would need at least 10G ram to hold the USGS and NGA gazateers.
On Fri, Oct 25, 2013 at 6:21 AM, Mark G <[email protected]> wrote: > I wrote a quick lucene RAMDirectory in memory index, it looks like a valid > option to hold the gazateers and it provides good text search of course. > The idea is that at runtime the geoentitylinker would pull three files off > disk, the NGAGeonames file, the USGS FIle, and the CountryContext indicator > file and lucene index them in memory,. initially this will take a while. > So, deployment wise, you would have to use your tool of choice (ie Puppet) > to distribute the files to each node, or mount a share to each node. My > concern with this approach is that each MR Task runs in it's own JVM, so > each task on each node will consume this much memory unless you do > something interesting with memory mapping. The EntityLinkerProperties file > will support the config of the file locations and whether to use DB or in > mem Lucene... > > I am also working on a Postgres version of the gazateer structures and > stored procs. > > Thoughts? > > > On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <[email protected]> wrote: > >> On 10/23/2013 01:14 PM, Mark G wrote: >> >>> All that being said, it is totally possible to run an in memory version >>> of >>> the gazateer. Personally, I like the DB approach, it provides a lot of >>> flexibility and power. >>> >> >> Yes, and you can even use a DB to run in-memory which works with the >> current implementation, >> I think I will experiment with that. >> >> I don't really mind using 3 GB memory for it, since my Hadoop servers >> have more than enough anyway, >> and it makes the deployment easier (don't have to deal with installing >> MySQL >> databases and keeping them in sync). >> >> Jörn >> > >
