Just to elaborate- The RAMDirectory storage is in Java GC. This makes
Java GC work very very hard. A memory-mapped file is a write-through
cache for file contents. The memory in the cache is outside of Java
garbage collection. A memory-mapped index will take a little less time
to create at these volumes. Loading a pre-built memory-mapped index will
be under 5 seconds.
On 10/29/2013 03:43 PM, Mark G wrote:
thanks, that was my next option with lucene. Build the indexes from the gaz
files and keep them up to date in one place, and make sure something like
puppet will distribute them to each node in a cluster on some interval,
then each task (map reduce or whatever) can use that file resource. I'll
let everyone know how it goes
MG
On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog <[email protected]> wrote:
This is what memory-mapped file indexes are for! RAMDirectory is for very
small projects.
On 10/29/2013 04:00 AM, Mark G wrote:
FYI, I implemented an in mem lucene index of the NGA Geonames. It was
almost 7 GB ram and took about 40 minutes to load.
Still looking at other DBs/Indexes. So one would need at least 10G ram to
hold the USGS and NGA gazateers.
On Fri, Oct 25, 2013 at 6:21 AM, Mark G <[email protected]> wrote:
I wrote a quick lucene RAMDirectory in memory index, it looks like a
valid
option to hold the gazateers and it provides good text search of course.
The idea is that at runtime the geoentitylinker would pull three files
off
disk, the NGAGeonames file, the USGS FIle, and the CountryContext
indicator
file and lucene index them in memory,. initially this will take a while.
So, deployment wise, you would have to use your tool of choice (ie
Puppet)
to distribute the files to each node, or mount a share to each node. My
concern with this approach is that each MR Task runs in it's own JVM, so
each task on each node will consume this much memory unless you do
something interesting with memory mapping. The EntityLinkerProperties
file
will support the config of the file locations and whether to use DB or in
mem Lucene...
I am also working on a Postgres version of the gazateer structures and
stored procs.
Thoughts?
On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <[email protected]>
wrote:
On 10/23/2013 01:14 PM, Mark G wrote:
All that being said, it is totally possible to run an in memory version
of
the gazateer. Personally, I like the DB approach, it provides a lot of
flexibility and power.
Yes, and you can even use a DB to run in-memory which works with the
current implementation,
I think I will experiment with that.
I don't really mind using 3 GB memory for it, since my Hadoop servers
have more than enough anyway,
and it makes the deployment easier (don't have to deal with installing
MySQL
databases and keeping them in sync).
Jörn