Hi Benedikt! Thanks for bootstrapping the project :) > - spellchecker package: nice idea, which I haven't thought about before. Further more I could imagine a hyphenation package. Both should be locale dependend. Great idea this hyphenation package, +1. I would already have a use case in a production tool that parses some crazy PDF's and uses OpenNLP. We have cases where we need to check hyphenation before tagging the words, and at the moment what we have is a not so elegant solution.
If we manage to create a common API for spellchecker we could, perhaps, create implementations that call hunspell and jazzy, that have artefacts in maven central and are fairly easy to use. Not sure if that fits in text, maybe only the common interface. > - Looking at EditDistance [3] I'm not sure we need T extends Number, if the only possible values for T are Integer and Double. Maybe we only need an IntegerEditDistance and a DoubleEditDistance. Could be. As the code was quickly written for a proof of concept for a customer, definitely there are parts that need further thinking. I'd fine with either T extends Number or IntegerEditDistance and DoubleEditDistance. > Regarding the last point: I'm currently not fond that there is a common interface fot EditingDistance algorithms. For example Levenshtein has the optional threshold parameter, which Jaro-Winkler has not (at least judging from the implementation in [lang]). Fuzzy Distance needs a locale for uncapitalizing. I think finding an interface that fits them all will be difficult to accomplish... But we'll see :-) I had thought about just a marker interface. So I could write some code to scan the classpath looking for implementations of this interface and let the user decide which one to use for his data quality job (regardless of the different parameters used in each algorithm). I shamelessly stole the name StringMetric from this Wikipedia article [1], but maybe we could find a better name for it? Thanks again Benedikt! Bruno [1] http://en.wikipedia.org/wiki/String_metric From: Benedikt Ritter <brit...@apache.org> To: Commons Developers List <dev@commons.apache.org> Sent: Wednesday, November 12, 2014 10:34 AM Subject: [text] Incorporating Bruno Kinoshita's work Hi, the git repo for [text] is ready and I've done the initial bootstraping already. I've also created a new component in the SANDBOX jira project. The first issue is to extract algorithms from [lang] [1]. I remember people saying, that theere is code in codec too. Please feel free to create tickets for this. Bruno already has some code that may fit into [text] [2]. I've given it a brief review an here are few things which caught my eye: - Inclusion of Talend code into [text] is not possible (the is code licensed by www.talend.com) - spellchecker package: nice idea, which I haven't thought about before. Further more I could imagine a hyphenation package. Both should be locale dependend. - Looking at EditDistance [3] I'm not sure we need T extends Number, if the only possible values for T are Integer and Double. Maybe we only need an IntegerEditDistance and a DoubleEditDistance. Regarding the last point: I'm currently not fond that there is a common interface fot EditingDistance algorithms. For example Levenshtein has the optional threshold parameter, which Jaro-Winkler has not (at least judging from the implementation in [lang]). Fuzzy Distance needs a locale for uncapitalizing. I think finding an interface that fits them all will be difficult to accomplish... But we'll see :-) Regards, Benedikt [1] https://issues.apache.org/jira/browse/SANDBOX-483 [2] https://github.com/kinow/text/tree/master/src/main/java/text/string_metric [3] https://github.com/kinow/text/blob/master/src/main/java/text/string_metric/EditDistance.java -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter