By the way, I could help with words frequencies for some langauges. e.g. Portuguese, Spanish, Dutch.
Ruud On 16-07-13 14:20, R.J. Baars wrote: > Coding word frequencies as a character is fine. I think it would be > classes, logarithmic as far as I am concerned. > > Ruud > >> W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze: >>> 2013/7/15 Marcin Miłkowski <[email protected]>: >>>> Hi Jaume, >>>> >>>> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze: >>>>> Hi, Marcin. >>>>> >>>>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK, >>>>> all the changes are there. Thank you. >>>> Great. We'll release 1.7.1, this is just a minor bug fix. >>>> >>>> BTW, when you see something you want to fix, just make a fork on github >>>> to fix it, then file an issue, and then make a pull request associated >>>> with that issue. That way, it will be much easier to develop the >>>> library >>>> with your changes. >>> I'll try to do it. >>> >>>> Also, if you'll find time to use a proper way of removing duplicates >>>> (now we lose information from CandidateData that might be significant >>>> for something - I know this is me being fussy, this is quite clean). >>> There are different ways to do it: >>> - We could test for duplicates in addCandidate()... >>> - "candidates" could be a Set, but then it needs to be converted to a >>> List to be sorted... >> Not really. We can use a TreeSet with a custom comparator: >> >> http://stackoverflow.com/a/4165893 >> >>> If you want to keep the distance information outside Speller.java, >>> that's a different a matter. >>> >>> >>> The next step for improving the suggestions would be to use a list of >>> frequent words. I'm thinking of just a list of manually selected words >>> or at most a few thousand words from a frequency dictionary. >> Yes. Frequency dictionaries would be very useful. >> >> I think we can represent frequency classes as ten ranges of percentages >> with 10 ASCII characters (A-K), as this would be in the tradition of the >> fsa encoding. So "A" would be the most common words (like 'the' and 'a' >> in English), etc. I think we don't need to have a better resolution here. >> >> Or we could simply use a numerical percentage in its decimal (rounded) >> representation from 000 to 100. This, however, would make the dictionary >> slightly bigger. >> >> Regards, >> Marcin >> >>> Regards, >>> Jaume >>> >>> >>>> Regards, >>>> Marcin >>>> >>>>> Now we need a release with the changes, and we'll be able to adapt the >>>>> tests. >>>>> >>>>> Regards, >>>>> Jaume >>>>> Salutacions, >>>>> Jaume Ortolà >>>>> www.riuraueditors.cat >>>>> >>>>> >>>>> >>>>> 2013/7/15 Marcin Miłkowski <[email protected]>: >>>>>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze: >>>>>>> Thanks, Marcin. >>>>>>> >>>>>>> Some remarks. The improvements I sent to the list 15 days ago have >>>>>>> not >>>>>>> been added, and moreover I have found more bugs. >>>>>> I'm really sorry but there are 200 mails from the mailing list over >>>>>> the >>>>>> last two weeks and I have been away from my e-mail. Could you please >>>>>> add >>>>>> your changes as issues on github for morfologik-stemming? This way it >>>>>> would make it much easier for us to track these things. >>>>>> >>>>>>> I attach the code I'm using now and explain briefly the reasons for >>>>>>> the changes. >>>>>>> >>>>>>> - In the getAllReplacements method we need to make sure that the >>>>>>> replacements are done from left to right. We must complete the >>>>>>> for-loop of the replacement pairs, choose the first possible >>>>>>> replacement (form left to right) and then start the two new branches >>>>>>> (with and without replacement). Otherwise, some replacements are not >>>>>>> done. >>>>>> OK, this sounds OK. I integrated your changes. >>>>>> >>>>>>> - If there is "ss" as a key in the replacement pairs, and somebody >>>>>>> uses a long string of s ("ssssssssss...") as input text, this can >>>>>>> cause the method to consume all the memory, as the algorithm is >>>>>>> exponential (2^(number of replacements)). This happened to us in an >>>>>>> online server, and the LT server crashed. The depth of the recursive >>>>>>> algorithm should be limited to 4 o 5 levels at most. >>>>>> Is that in getAllReplacements()? >>>>>> >>>>>>> - It is possible that different "words to check" give the same >>>>>>> suggestion. So at some point we need to remove duplicates. I do this >>>>>>> at the end of findReplacements(). >>>>>> You are right. We could probably write the same code in a slightly >>>>>> more >>>>>> elegant way, without converting this to a LinkedHashSet but simply by >>>>>> adding to a set when iterating the list. >>>>>> >>>>>>> - The conditions around line 238 (current github version 1.7) are >>>>>>> not >>>>>>> correct. The first isInDictionary makes the lower case conversion >>>>>>> useless: >>>>>>> >>>>>>> if (isInDictionary(wordChecked) >>>>>>> && >>>>>>> dictionaryMetadata.isConvertingCase() >>>>>>> && isMixedCase(wordChecked) >>>>>>> && >>>>>>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale()))) >>>>>>> >>>>>>> I think they should be something like: >>>>>>> >>>>>>> if (isInDictionary(wordChecked) >>>>>>> || (dictionaryMetadata.convertCase >>>>>>> && isMixedCase(wordChecked) >>>>>>> && isInDictionary(wordChecked >>>>>>> >>>>>>> .toLowerCase(dictionaryMetadata.dictionaryLocale)))) >>>>>> Fixed! >>>>>> >>>>>> I tried to add your fixes but your code is now quite far away from >>>>>> ours, >>>>>> so diff does not give any meaningful output. Please review the code >>>>>> on >>>>>> github, and if needed, file an issue over changes that need to be >>>>>> done. >>>>>> >>>>>> Regards, >>>>>> Marcin >>>>>> >>>>>>> Regards, >>>>>>> Jaume Ortolà >>>>>>> Salutacions, >>>>>>> Jaume Ortolà >>>>>>> www.riuraueditors.cat >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2013/7/15 Marcin Miłkowski <[email protected]>: >>>>>>>> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Dawid just released morfologik 1.7 on Maven. So we can actually go >>>>>>>>> on >>>>>>>>> and include a newer version in LT. >>>>>>>>> >>>>>>>>> The new version still does not support compounding but it has all >>>>>>>>> the >>>>>>>>> features required for getting better diacritic suggestions. >>>>>>>> Here's the documentation: >>>>>>>> >>>>>>>> http://wiki.languagetool.org/hunspell-support#toc5 >>>>>>>> >>>>>>>> Best, >>>>>>>> Marcin >>>>>>>> >>>>>>>> >>>>>>>>> Best, >>>>>>>>> Marcin >>>>>>>>> >>>>>>>>> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze: >>>>>>>>>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze: >>>>>>>>>>> Hi Marcin, >>>>>>>>>>> >>>>>>>>>>> I have been using the still unreleased code of >>>>>>>>>>> morfologik-stemming and I >>>>>>>>>>> have made improvements to Speller.java for some previously >>>>>>>>>>> unforseen >>>>>>>>>>> cases. See the attachement. >>>>>>>>>>> >>>>>>>>>>> In order to complete the development, and test & debug with all >>>>>>>>>>> languages, perhaps we could include temporarily the morfologik >>>>>>>>>>> module >>>>>>>>>>> inside LanguageTool. This will make thinks easier. What do yo >>>>>>>>>>> think? >>>>>>>>>> No. I should make a release, forking morfologik makes no sense to >>>>>>>>>> me. >>>>>>>>>> >>>>>>>>>> The only thing that stops me is the lack of time to work on >>>>>>>>>> compounds. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Marcin >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> >>>>>>>>>> This SF.net email is sponsored by Windows: >>>>>>>>>> >>>>>>>>>> Build for Windows Store. >>>>>>>>>> >>>>>>>>>> http://p.sf.net/sfu/windows-dev2dev >>>>>>>>>> _______________________________________________ >>>>>>>>>> Languagetool-devel mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> See everything from the browser to the database with AppDynamics >>>>>>>> Get end-to-end visibility with application monitoring from >>>>>>>> AppDynamics >>>>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>>>> Start your free trial of AppDynamics Pro today! >>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>>>> _______________________________________________ >>>>>>>> Languagetool-devel mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> See everything from the browser to the database with AppDynamics >>>>>>>> Get end-to-end visibility with application monitoring from >>>>>>>> AppDynamics >>>>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>>>> Start your free trial of AppDynamics Pro today! >>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Languagetool-devel mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>> ------------------------------------------------------------------------------ >>>>>> See everything from the browser to the database with AppDynamics >>>>>> Get end-to-end visibility with application monitoring from >>>>>> AppDynamics >>>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>>> Start your free trial of AppDynamics Pro today! >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>>> _______________________________________________ >>>>>> Languagetool-devel mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>> ------------------------------------------------------------------------------ >>>>> See everything from the browser to the database with AppDynamics >>>>> Get end-to-end visibility with application monitoring from AppDynamics >>>>> Isolate bottlenecks and diagnose root cause in seconds. >>>>> Start your free trial of AppDynamics Pro today! >>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>>> _______________________________________________ >>>>> Languagetool-devel mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>>> ------------------------------------------------------------------------------ >>>> See everything from the browser to the database with AppDynamics >>>> Get end-to-end visibility with application monitoring from AppDynamics >>>> Isolate bottlenecks and diagnose root cause in seconds. >>>> Start your free trial of AppDynamics Pro today! >>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> ------------------------------------------------------------------------------ >>> See everything from the browser to the database with AppDynamics >>> Get end-to-end visibility with application monitoring from AppDynamics >>> Isolate bottlenecks and diagnose root cause in seconds. >>> Start your free trial of AppDynamics Pro today! >>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Languagetool-devel mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> >> ------------------------------------------------------------------------------ >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Languagetool-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> > > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
