Re: request for Input or ideas.... EntityLinker tickets
I have never used UIMA, but I have heard good things. All the analytics processes I run are in Hadoop Mapreduce and there are cascading jobs that do many different things. However, this sounds like a good idea for a solution wrapper, and I understand and agree with your concern about creating classes which combine components. I would like to try it in UIMA, sounds great, where in the UIMA project do I start? On Tue, Oct 22, 2013 at 2:29 PM, Jörn Kottmann kottm...@gmail.com wrote: On 10/05/2013 11:58 PM, Mark G wrote: 4. provide a solution wrapper for the Geotagging capability In order to make the GeoTagging a bit more out of the box functional, I was thinking of creating a class that one calls find(MaxentModel, doc, sentencedetector, EntityLinkerProperties) to abstract the current impl. I know this is not standard practice, just want to see what you all think. This would make it easier to get this thing running. What do you think about using a solution like UIMA to do this? I am not sure how you are intending to run your NLP pipelines but in my experiences that has worked out really well. UIMA can help to solve some production problems like scalability, error handling, etc. If you are interested in this you could write an Analysis Engine for the Entity Linker and add it to opennlp-uima. I still believe it is not a good idea to make classes which combine components to use them out of the box, because that never really suits all of our users, and it is easy to implement inside a user project. Anyway we should add command line support and implement a class which can demonstrate how the entity linker works in a similar fashion as our other command line tools. Jörn
Re: request for Input or ideas.... EntityLinker tickets
The database is only about 3GB of storage right now.Since I used pure JDBC and JDBC style stored proc calls, it can run with any JDBC driver, and all the connection props are in the EntityLinkerProperties file, so it can run on other database engines. Currently it is optional to use the MySQL fuzzy string matching, all one has to do is change the stored proc to boolean mode rather than natural language mode. If you really mean, do we have to use mysql FULL TEXT *INDEXING*, then no, but with around 10Million toponymns it provides super fast lookups without consuming a lot of memory. If I was running the OpenNLP GeoEntityLinker in say, Map Reduce, and I am running multiple tasks on each node, I would not want to pull 3GB into memory for each task. The way it is now one could distribute MySQL to each node via something like Puppet and it would serve requests from the tasks on that node. Or if they have a beefy server they could make one large instance of MySQL and have each node connect from the cluster. All that being said, it is totally possible to run an in memory version of the gazateer. Personally, I like the DB approach, it provides a lot of flexibility and power. On Tue, Oct 22, 2013 at 2:39 PM, Jörn Kottmann kottm...@gmail.com wrote: On 10/05/2013 11:58 PM, Mark G wrote: 3. fuzzy string matching should be part of the scoring, this would allow mysql fuzzy search to return more candidate toponyms. Currently, the search into the MySQL gazateers is using boolean mode and each NER result is passed in as a literal string. If I implement a fuzzy string matching based score (do we have one?) the user could turn on natural language mode in MySQL then we can generate a score and thresh to allow for more recall on transliterated names etc I would also like to use proximity to the majority of points in the document as a disambiguation criteria as well. It would probably be nice if this would work with other databases too, e.g. Apache Derby, or some in-memory database, maybe even Lucene. Would it be possible to not use the MySQL fuzzy string matching feature for this? I would like to run your code, but its difficult to scale the MySQL database in my scenario, but I have lots of RAM and believe the geonames dataset could fit into it to provide super fast lookups for me on my worker servers. Jörn
Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java
Hi guys I would vote for java 7, as well. Thank you. BR, Ioan On Wed, Oct 23, 2013 at 6:24 PM, Mark G giaconiam...@gmail.com wrote: agree, straight to 7 makes sense to me... try with resources, better collections support, switch on strings etc all new in 7 On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann kottm...@gmail.com wrote: On 10/23/2013 01:21 PM, Mark G wrote: When will we move to 6? I don't have any strong opinions about moving forward. Some say it might be better to move directly to Java 7 or even wait for Java 8. There are not that many interesting new features in Java 6, thats why I believe it might be worth to make a bigger step to avoid one or two versions. Any opinions? Do we still have a Java 5 user here? Jörn
Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java
+1 move do 6 or 7 for the next major release. We can ask what our users think of it. Em quarta-feira, 23 de outubro de 2013, Ioan Barbulescu escreveu: Hi guys I would vote for java 7, as well. Thank you. BR, Ioan On Wed, Oct 23, 2013 at 6:24 PM, Mark G giaconiam...@gmail.comjavascript:; wrote: agree, straight to 7 makes sense to me... try with resources, better collections support, switch on strings etc all new in 7 On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann kottm...@gmail.comjavascript:; wrote: On 10/23/2013 01:21 PM, Mark G wrote: When will we move to 6? I don't have any strong opinions about moving forward. Some say it might be better to move directly to Java 7 or even wait for Java 8. There are not that many interesting new features in Java 6, thats why I believe it might be worth to make a bigger step to avoid one or two versions. Any opinions? Do we still have a Java 5 user here? Jörn -- William Colen