W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: > 2013/7/15 Marcin Miłkowski <[email protected]>: >> Hi Jaume, >> >> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: >>> Hi, Marcin. >>> >>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, >>> all the changes are there. Thank you. >> >> Great. We'll release 1.7.1, this is just a minor bug fix. >> >> BTW, when you see something you want to fix, just make a fork on github >> to fix it, then file an issue, and then make a pull request associated >> with that issue. That way, it will be much easier to develop the library >> with your changes. > > I'll try to do it. > >> Also, if you'll find time to use a proper way of removing duplicates >> (now we lose information from CandidateData that might be significant >> for something - I know this is me being fussy, this is quite clean). > > There are different ways to do it: > - We could test for duplicates in addCandidate()... > - "candidates" could be a Set, but then it needs to be converted to a > List to be sorted...
Not really. We can use a TreeSet with a custom comparator: http://stackoverflow.com/a/4165893 > > If you want to keep the distance information outside Speller.java, > that's a different a matter. > > > The next step for improving the suggestions would be to use a list of > frequent words. I'm thinking of just a list of manually selected words > or at most a few thousand words from a frequency dictionary. Yes. Frequency dictionaries would be very useful. I think we can represent frequency classes as ten ranges of percentages with 10 ASCII characters (A-K), as this would be in the tradition of the fsa encoding. So "A" would be the most common words (like 'the' and 'a' in English), etc. I think we don't need to have a better resolution here. Or we could simply use a numerical percentage in its decimal (rounded) representation from 000 to 100. This, however, would make the dictionary slightly bigger. Regards, Marcin > > Regards, > Jaume > > >> Regards, >> Marcin >> >>> >>> Now we need a release with the changes, and we'll be able to adapt the >>> tests. >>> >>> Regards, >>> Jaume >>> Salutacions, >>> Jaume Ortolà >>> www.riuraueditors.cat >>> >>> >>> >>> 2013/7/15 Marcin Miłkowski <[email protected]>: >>>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >>>>> Thanks, Marcin. >>>>> >>>>> Some remarks. The improvements I sent to the list 15 days ago have not >>>>> been added, and moreover I have found more bugs. >>>> I'm really sorry but there are 200 mails from the mailing list over the >>>> last two weeks and I have been away from my e-mail. Could you please add >>>> your changes as issues on github for morfologik-stemming? This way it >>>> would make it much easier for us to track these things. >>>> >>>>> I attach the code I'm using now and explain briefly the reasons for the >>>>> changes. >>>>> >>>>> - In the getAllReplacements method we need to make sure that the >>>>> replacements are done from left to right. We must complete the >>>>> for-loop of the replacement pairs, choose the first possible >>>>> replacement (form left to right) and then start the two new branches >>>>> (with and without replacement). Otherwise, some replacements are not >>>>> done. >>>> OK, this sounds OK. I integrated your changes. >>>> >>>>> - If there is "ss" as a key in the replacement pairs, and somebody >>>>> uses a long string of s ("ssssssssss...") as input text, this can >>>>> cause the method to consume all the memory, as the algorithm is >>>>> exponential (2^(number of replacements)). This happened to us in an >>>>> online server, and the LT server crashed. The depth of the recursive >>>>> algorithm should be limited to 4 o 5 levels at most. >>>> Is that in getAllReplacements()? >>>> >>>>> - It is possible that different "words to check" give the same >>>>> suggestion. So at some point we need to remove duplicates. I do this >>>>> at the end of findReplacements(). >>>> You are right. We could probably write the same code in a slightly more >>>> elegant way, without converting this to a LinkedHashSet but simply by >>>> adding to a set when iterating the list. >>>> >>>>> - The conditions around line 238 (current github version 1.7) are not >>>>> correct. The first isInDictionary makes the lower case conversion >>>>> useless: >>>>> >>>>> if (isInDictionary(wordChecked) >>>>> && dictionaryMetadata.isConvertingCase() >>>>> && isMixedCase(wordChecked) >>>>> && >>>>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale()))) >>>>> >>>>> I think they should be something like: >>>>> >>>>> if (isInDictionary(wordChecked) >>>>> || (dictionaryMetadata.convertCase >>>>> && isMixedCase(wordChecked) >>>>> && isInDictionary(wordChecked >>>>> .toLowerCase(dictionaryMetadata.dictionaryLocale)))) >>>> Fixed! >>>> >>>> I tried to add your fixes but your code is now quite far away from ours, >>>> so diff does not give any meaningful output. Please review the code on >>>> github, and if needed, file an issue over changes that need to be done. >>>> >>>> Regards, >>>> Marcin >>>> >>>>> Regards, >>>>> Jaume Ortolà >>>>> Salutacions, >>>>> Jaume Ortolà >>>>> www.riuraueditors.cat >>>>> >>>>> >>>>> >>>>> 2013/7/15 Marcin Miłkowski <[email protected]>: >>>>>> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: >>>>>>> Hi, >>>>>>> >>>>>>> Dawid just released morfologik 1.7 on Maven. So we can actually go on >>>>>>> and include a newer version in LT. >>>>>>> >>>>>>> The new version still does not support compounding but it has all the >>>>>>> features required for getting better diacritic suggestions. >>>>>> Here's the documentation: >>>>>> >>>>>> http://wiki.languagetool.org/hunspell-support#toc5 >>>>>> >>>>>> Best, >>>>>> Marcin >>>>>> >>>>>> >>>>>>> Best, >>>>>>> Marcin >>>>>>> >>>>>>> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: >>>>>>>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >>>>>>>>> Hi Marcin, >>>>>>>>> >>>>>>>>> I have been using the still unreleased code of morfologik-stemming >>>>>>>>> and I >>>>>>>>> have made improvements to Speller.java for some previously unforseen >>>>>>>>> cases. See the attachement. >>>>>>>>> >>>>>>>>> In order to complete the development, and test & debug with all >>>>>>>>> languages, perhaps we could include temporarily the morfologik module >>>>>>>>> inside LanguageTool. This will make thinks easier. What do yo think? >>>>>>>> No. I should make a release, forking morfologik makes no sense to me. >>>>>>>> >>>>>>>> The only thing that stops me is the lack of time to work on compounds. >>>>>>>> >>>>>>>> Best, >>>>>>>> Marcin >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> >>>>>>>> This SF.net email is sponsored by Windows: >>>>>>>> >>>>>>>> Build for Windows Store. >>>>>>>> >>>>>>>> http://p.sf.net/sfu/windows-dev2dev >>>>>>>> _______________________________________________ >>>>>>>> Languagetool-devel mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> See everything from the browser to the database with AppDynamics >>>>>> Get end-to-end visibility with application monitoring from AppDynamics >>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>> Start your free trial of AppDynamics Pro today! >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>> _______________________________________________ >>>>>> Languagetool-devel mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> See everything from the browser to the database with AppDynamics >>>>>> Get end-to-end visibility with application monitoring from AppDynamics >>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>> Start your free trial of AppDynamics Pro today! >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Languagetool-devel mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>>> ------------------------------------------------------------------------------ >>>> See everything from the browser to the database with AppDynamics >>>> Get end-to-end visibility with application monitoring from AppDynamics >>>> Isolate bottlenecks and diagnose root cause in seconds. >>>> Start your free trial of AppDynamics Pro today! >>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> ------------------------------------------------------------------------------ >>> See everything from the browser to the database with AppDynamics >>> Get end-to-end visibility with application monitoring from AppDynamics >>> Isolate bottlenecks and diagnose root cause in seconds. >>> Start your free trial of AppDynamics Pro today! >>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Languagetool-devel mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> >> >> ------------------------------------------------------------------------------ >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Languagetool-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
