Hi Benedikt!
Thanks for bootstrapping the project :)
> - spellchecker package: nice idea, which I haven't thought about before.
Further more I could imagine a hyphenation package. Both should be locale
dependend.
Great idea this hyphenation package, +1. I would already have a use case in a 
production tool that parses some crazy PDF's and uses OpenNLP. We have cases 
where we need to check hyphenation before tagging the words, and at the moment 
what we have is a not so elegant solution.

If we manage to create a common API for spellchecker we could, perhaps, create 
implementations that call hunspell and jazzy, that have artefacts in maven 
central and are fairly easy to use. Not sure if that fits in text, maybe only 
the common interface.

> - Looking at EditDistance [3] I'm not sure we need T extends Number, if the
only possible values for T are Integer and Double. Maybe we only need an
IntegerEditDistance and a DoubleEditDistance.
Could be. As the code was quickly written for a proof of concept for a 
customer, definitely there are parts that need further thinking. I'd fine with 
either T extends Number or IntegerEditDistance and DoubleEditDistance.

> Regarding the last point: I'm currently not fond that there is a common
interface fot EditingDistance algorithms. For example Levenshtein has the
optional threshold parameter, which Jaro-Winkler has not (at least judging
from the implementation in [lang]). Fuzzy Distance needs a locale for
uncapitalizing. I think finding an interface that fits them all will be
difficult to accomplish... But we'll see :-)
I had thought about just a marker interface. So I could write some code to scan 
the classpath looking for implementations of this interface and let the user 
decide which one to use for his data quality job (regardless of the different 
parameters used in each algorithm). 

I shamelessly stole the name StringMetric from this Wikipedia article [1], but 
maybe we could find a better name for it?
Thanks again Benedikt!
Bruno
[1] http://en.wikipedia.org/wiki/String_metric

 
      From: Benedikt Ritter <brit...@apache.org>
 To: Commons Developers List <dev@commons.apache.org> 
 Sent: Wednesday, November 12, 2014 10:34 AM
 Subject: [text] Incorporating Bruno Kinoshita's work
   
Hi,

the git repo for [text] is ready and I've done the initial bootstraping
already. I've also created a new component in the SANDBOX jira project. The
first issue is to extract algorithms from [lang] [1]. I remember people
saying, that theere is code in codec too. Please feel free to create
tickets for this.

Bruno already has some code that may fit into [text] [2]. I've given it a
brief review an here are few things which caught my eye:

- Inclusion of Talend code into [text] is not possible (the is code
licensed by www.talend.com)
- spellchecker package: nice idea, which I haven't thought about before.
Further more I could imagine a hyphenation package. Both should be locale
dependend.
- Looking at EditDistance [3] I'm not sure we need T extends Number, if the
only possible values for T are Integer and Double. Maybe we only need an
IntegerEditDistance and a DoubleEditDistance.

Regarding the last point: I'm currently not fond that there is a common
interface fot EditingDistance algorithms. For example Levenshtein has the
optional threshold parameter, which Jaro-Winkler has not (at least judging
from the implementation in [lang]). Fuzzy Distance needs a locale for
uncapitalizing. I think finding an interface that fits them all will be
difficult to accomplish... But we'll see :-)

Regards,
Benedikt

[1] https://issues.apache.org/jira/browse/SANDBOX-483
[2]
https://github.com/kinow/text/tree/master/src/main/java/text/string_metric
[3]
https://github.com/kinow/text/blob/master/src/main/java/text/string_metric/EditDistance.java

-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter


   
 

Reply via email to