I finished with the Lucene indexing of the Gazateers, just need to get them tied into the gaz lookups, which is fairly simple. Do you all think I should disregard all the MySQL dependency and just have Lucene? The lucene index files are only about 2.5 gigs total, so very manageable to distribute the files across a cluster. I could keep the MySQL classes as an option, but at this point the Lucene based approach is really growing on me. If I don't here from anyone I am going to remove the MySQL implementation. Thanks MG
On Wed, Oct 30, 2013 at 7:34 PM, Lance Norskog <[email protected]> wrote: > Just to elaborate- The RAMDirectory storage is in Java GC. This makes Java > GC work very very hard. A memory-mapped file is a write-through cache for > file contents. The memory in the cache is outside of Java garbage > collection. A memory-mapped index will take a little less time to create at > these volumes. Loading a pre-built memory-mapped index will be under 5 > seconds. > > > On 10/29/2013 03:43 PM, Mark G wrote: > >> thanks, that was my next option with lucene. Build the indexes from the >> gaz >> files and keep them up to date in one place, and make sure something like >> puppet will distribute them to each node in a cluster on some interval, >> then each task (map reduce or whatever) can use that file resource. I'll >> let everyone know how it goes >> MG >> >> >> On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog <[email protected]> wrote: >> >> This is what memory-mapped file indexes are for! RAMDirectory is for very >>> small projects. >>> >>> >>> On 10/29/2013 04:00 AM, Mark G wrote: >>> >>> FYI, I implemented an in mem lucene index of the NGA Geonames. It was >>>> almost 7 GB ram and took about 40 minutes to load. >>>> Still looking at other DBs/Indexes. So one would need at least 10G ram >>>> to >>>> hold the USGS and NGA gazateers. >>>> >>>> >>>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G <[email protected]> wrote: >>>> >>>> I wrote a quick lucene RAMDirectory in memory index, it looks like a >>>> >>>>> valid >>>>> option to hold the gazateers and it provides good text search of >>>>> course. >>>>> The idea is that at runtime the geoentitylinker would pull three files >>>>> off >>>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext >>>>> indicator >>>>> file and lucene index them in memory,. initially this will take a >>>>> while. >>>>> So, deployment wise, you would have to use your tool of choice (ie >>>>> Puppet) >>>>> to distribute the files to each node, or mount a share to each node. My >>>>> concern with this approach is that each MR Task runs in it's own JVM, >>>>> so >>>>> each task on each node will consume this much memory unless you do >>>>> something interesting with memory mapping. The EntityLinkerProperties >>>>> file >>>>> will support the config of the file locations and whether to use DB or >>>>> in >>>>> mem Lucene... >>>>> >>>>> I am also working on a Postgres version of the gazateer structures and >>>>> stored procs. >>>>> >>>>> Thoughts? >>>>> >>>>> >>>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <[email protected]> >>>>> wrote: >>>>> >>>>> On 10/23/2013 01:14 PM, Mark G wrote: >>>>> >>>>>> All that being said, it is totally possible to run an in memory >>>>>> version >>>>>> >>>>>>> of >>>>>>> the gazateer. Personally, I like the DB approach, it provides a lot >>>>>>> of >>>>>>> flexibility and power. >>>>>>> >>>>>>> Yes, and you can even use a DB to run in-memory which works with >>>>>>> the >>>>>>> >>>>>> current implementation, >>>>>> I think I will experiment with that. >>>>>> >>>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers >>>>>> have more than enough anyway, >>>>>> and it makes the deployment easier (don't have to deal with installing >>>>>> MySQL >>>>>> databases and keeping them in sync). >>>>>> >>>>>> Jörn >>>>>> >>>>>> >>>>>> >
