Re: request for Input or ideas.... EntityLinker tickets

2013-10-23 Thread Mark G
I have never used UIMA, but I have heard good things. All the analytics
processes I run are in Hadoop Mapreduce and there are cascading jobs that
do many different things. However, this sounds like a good idea for a
solution wrapper, and I understand and agree with your concern about
creating classes which combine components.
I would like to try it in UIMA, sounds great, where in the UIMA project do
I start?


On Tue, Oct 22, 2013 at 2:29 PM, Jörn Kottmann kottm...@gmail.com wrote:

 On 10/05/2013 11:58 PM, Mark G wrote:

 4. provide a solution wrapper for the Geotagging capability

 In order to make the GeoTagging a bit more out of the box functional, I
 was thinking of creating a class that one calls find(MaxentModel, doc,
 sentencedetector, EntityLinkerProperties) to abstract the current impl. I
 know this is not standard practice, just want to see what you all think.
 This would make it easier to get this thing running.



 What do you think about using a solution like UIMA to do this? I am not
 sure how you
 are intending to run your NLP pipelines but in my experiences that has
 worked out
 really well. UIMA can help to solve some production problems like
 scalability, error handling,
 etc.

 If you are interested in this you could write an Analysis Engine for the
 Entity Linker and add
 it to opennlp-uima.

 I still believe it is not a good idea to make classes which combine
 components to use them out of
 the box, because that never really suits all of our users, and it is easy
 to implement inside a user project.

 Anyway we should add command line support and implement a class which can
 demonstrate how the entity linker
 works in a similar fashion as our other command line tools.

 Jörn



Re: request for Input or ideas.... EntityLinker tickets

2013-10-23 Thread Mark G
The database is only about 3GB of storage right now.Since I used pure JDBC
and JDBC style stored proc calls, it can run with any JDBC driver, and all
the connection props are in the EntityLinkerProperties file, so it can run
on other database engines. Currently it is optional to use the MySQL fuzzy
string matching, all one has to do is change the stored proc to boolean
mode rather than natural language mode. If you really mean, do we have to
use mysql FULL TEXT *INDEXING*, then no, but with around 10Million
toponymns it provides super fast lookups without consuming a lot of memory.
If I was running the OpenNLP GeoEntityLinker in say, Map Reduce, and I am
running multiple tasks on each node, I would not want to pull 3GB into
memory for each task. The way it is now one could distribute MySQL to each
node via something like Puppet and it would serve requests from the tasks
on that node. Or if they have a beefy server they could make one large
instance of MySQL and have each node connect from the cluster.
All that being said, it is totally possible to run an in memory version of
the gazateer. Personally, I like the DB approach, it provides a lot of
flexibility and power.


On Tue, Oct 22, 2013 at 2:39 PM, Jörn Kottmann kottm...@gmail.com wrote:

 On 10/05/2013 11:58 PM, Mark G wrote:

 3. fuzzy string matching should be part of the scoring, this would allow
 mysql fuzzy search to return more candidate toponyms.

 Currently, the search into the MySQL gazateers is using boolean mode and
 each NER result is passed in as a literal string. If I implement a fuzzy
 string matching based score (do we have one?) the user could turn on
 natural language mode in MySQL then we can generate a score and thresh
 to
 allow for more recall on transliterated names etc
 I would also like to use proximity to the majority of points in the
 document as a disambiguation criteria as well.


 It would probably be nice if this would work with other databases too,
 e.g. Apache Derby,
 or some in-memory database, maybe even Lucene.

 Would it be possible to not use the MySQL fuzzy string matching feature
 for this?

 I would like to run your code, but its difficult to scale the MySQL
 database in my scenario,
 but I have lots of RAM and believe the geonames dataset could fit into it
 to provide
 super fast lookups for me on my worker servers.

 Jörn



Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-23 Thread Ioan Barbulescu
Hi guys

I would vote for java 7, as well.

Thank you.

BR,
Ioan


On Wed, Oct 23, 2013 at 6:24 PM, Mark G giaconiam...@gmail.com wrote:

 agree, straight to 7 makes sense to me... try with resources, better
 collections support, switch on strings etc all new in 7


 On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann kottm...@gmail.com wrote:

  On 10/23/2013 01:21 PM, Mark G wrote:
 
  When will we move to 6?
 
 
  I don't have any strong opinions about moving forward. Some say it might
  be better
  to move directly to Java 7 or even wait for Java 8.
 
  There are not that many interesting new features in Java 6, thats why I
  believe it might
  be worth to make a bigger step to avoid one or two versions.
 
  Any opinions? Do we still have a Java 5 user here?
 
  Jörn
 



Re: svn commit: r1534864 - /opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/entitylinker/GeoHashBinScorer.java

2013-10-23 Thread William Colen
+1 move do 6 or 7 for the next major release. We can ask what our users
think of it.

Em quarta-feira, 23 de outubro de 2013, Ioan Barbulescu escreveu:

 Hi guys

 I would vote for java 7, as well.

 Thank you.

 BR,
 Ioan


 On Wed, Oct 23, 2013 at 6:24 PM, Mark G giaconiam...@gmail.comjavascript:;
 wrote:

  agree, straight to 7 makes sense to me... try with resources, better
  collections support, switch on strings etc all new in 7
 
 
  On Wed, Oct 23, 2013 at 8:36 AM, Jörn Kottmann 
  kottm...@gmail.comjavascript:;
 wrote:
 
   On 10/23/2013 01:21 PM, Mark G wrote:
  
   When will we move to 6?
  
  
   I don't have any strong opinions about moving forward. Some say it
 might
   be better
   to move directly to Java 7 or even wait for Java 8.
  
   There are not that many interesting new features in Java 6, thats why I
   believe it might
   be worth to make a bigger step to avoid one or two versions.
  
   Any opinions? Do we still have a Java 5 user here?
  
   Jörn
  
 



-- 
William Colen