By the way, I could help with words frequencies for some langauges.
e.g. Portuguese, Spanish, Dutch.

Ruud

On 16-07-13 14:20, R.J. Baars wrote:
> Coding word frequencies as a character is fine. I think it would be
> classes, logarithmic as far as I am concerned.
>
> Ruud
>
>> W dniu 2013-07-16 00:03, Jaume Ortolà i Font pisze:
>>> 2013/7/15 Marcin MiÅ‚kowski <[email protected]>:
>>>> Hi Jaume,
>>>>
>>>> W dniu 2013-07-15 21:16, Jaume Ortolà i Font pisze:
>>>>> Hi, Marcin.
>>>>>
>>>>> I have tested the current code (1.8.0-SNAPSHOT) and everything is OK,
>>>>> all the changes are there. Thank you.
>>>> Great. We'll release 1.7.1, this is just a minor bug fix.
>>>>
>>>> BTW, when you see something you want to fix, just make a fork on github
>>>> to fix it, then file an issue, and then make a pull request associated
>>>> with that issue. That way, it will be much easier to develop the
>>>> library
>>>> with your changes.
>>> I'll try to do it.
>>>
>>>> Also, if you'll find time to use a proper way of removing duplicates
>>>> (now we lose information from CandidateData that might be significant
>>>> for something - I know this is me being fussy, this is quite clean).
>>> There are different ways to do it:
>>> - We could test for duplicates in addCandidate()...
>>> - "candidates" could be a Set, but then it needs to be converted to a
>>> List to be sorted...
>> Not really. We can use a TreeSet with a custom comparator:
>>
>> http://stackoverflow.com/a/4165893
>>
>>> If you want to keep the distance information outside Speller.java,
>>> that's a different a matter.
>>>
>>>
>>> The next step for improving the suggestions would be to use a list of
>>> frequent words. I'm thinking of just a list of manually selected words
>>> or at most a few thousand words from a frequency dictionary.
>> Yes. Frequency dictionaries would be very useful.
>>
>> I think we can represent frequency classes as ten ranges of percentages
>> with 10 ASCII characters (A-K), as this would be in the tradition of the
>> fsa encoding. So "A" would be the most common words (like 'the' and 'a'
>> in English), etc. I think we don't need to have a better resolution here.
>>
>> Or we could simply use a numerical percentage in its decimal (rounded)
>> representation from 000 to 100. This, however, would make the dictionary
>> slightly bigger.
>>
>> Regards,
>> Marcin
>>
>>> Regards,
>>> Jaume
>>>
>>>
>>>> Regards,
>>>> Marcin
>>>>
>>>>> Now we need a release with the changes, and we'll be able to adapt the
>>>>> tests.
>>>>>
>>>>> Regards,
>>>>> Jaume
>>>>> Salutacions,
>>>>> Jaume OrtolÃ
>>>>> www.riuraueditors.cat
>>>>>
>>>>>
>>>>>
>>>>> 2013/7/15 Marcin MiÅ‚kowski <[email protected]>:
>>>>>> W dniu 2013-07-15 12:41, Jaume Ortolà i Font pisze:
>>>>>>> Thanks, Marcin.
>>>>>>>
>>>>>>> Some remarks. The improvements I sent to the list 15 days ago have
>>>>>>> not
>>>>>>> been added, and moreover I have found more bugs.
>>>>>> I'm really sorry but there are 200 mails from the mailing list over
>>>>>> the
>>>>>> last two weeks and I have been away from my e-mail. Could you please
>>>>>> add
>>>>>> your changes as issues on github for morfologik-stemming? This way it
>>>>>> would make it much easier for us to track these things.
>>>>>>
>>>>>>> I attach the code I'm using now and explain briefly the reasons for
>>>>>>> the changes.
>>>>>>>
>>>>>>> - In the getAllReplacements method we need to make sure that the
>>>>>>> replacements are done from left to right. We must complete the
>>>>>>> for-loop of the replacement pairs, choose the first possible
>>>>>>> replacement (form left to right) and then start the two new branches
>>>>>>> (with and without replacement). Otherwise, some replacements are not
>>>>>>> done.
>>>>>> OK, this sounds OK. I integrated your changes.
>>>>>>
>>>>>>> - If there is "ss" as a key in the replacement pairs, and somebody
>>>>>>> uses a long string of s ("ssssssssss...") as input text, this can
>>>>>>> cause the method to consume all the memory, as the algorithm is
>>>>>>> exponential (2^(number of replacements)). This happened to us in an
>>>>>>> online server, and the LT server crashed. The depth of the recursive
>>>>>>> algorithm should be limited to 4 o 5 levels at most.
>>>>>> Is that in getAllReplacements()?
>>>>>>
>>>>>>> - It is possible that different "words to check" give the same
>>>>>>> suggestion. So at some point we need to remove duplicates. I do this
>>>>>>> at the end of findReplacements().
>>>>>> You are right. We could probably write the same code in a slightly
>>>>>> more
>>>>>> elegant way, without converting this to a LinkedHashSet but simply by
>>>>>> adding to a set when iterating the list.
>>>>>>
>>>>>>> - The conditions around line 238 (current github version 1.7) are
>>>>>>> not
>>>>>>> correct. The first isInDictionary makes the lower case conversion
>>>>>>> useless:
>>>>>>>
>>>>>>>                         if (isInDictionary(wordChecked)
>>>>>>>                                 &&
>>>>>>> dictionaryMetadata.isConvertingCase()
>>>>>>>                                 && isMixedCase(wordChecked)
>>>>>>>                                 &&
>>>>>>> isInDictionary(wordChecked.toLowerCase(dictionaryMetadata.getLocale())))
>>>>>>>
>>>>>>> I think they should be something like:
>>>>>>>
>>>>>>>               if (isInDictionary(wordChecked)
>>>>>>>                   || (dictionaryMetadata.convertCase
>>>>>>>                   && isMixedCase(wordChecked)
>>>>>>>                   && isInDictionary(wordChecked
>>>>>>>                       
>>>>>>> .toLowerCase(dictionaryMetadata.dictionaryLocale))))
>>>>>> Fixed!
>>>>>>
>>>>>> I tried to add your fixes but your code is now quite far away from
>>>>>> ours,
>>>>>> so diff does not give any meaningful output. Please review the code
>>>>>> on
>>>>>> github, and if needed, file an issue over changes that need to be
>>>>>> done.
>>>>>>
>>>>>> Regards,
>>>>>> Marcin
>>>>>>
>>>>>>> Regards,
>>>>>>> Jaume OrtolÃ
>>>>>>> Salutacions,
>>>>>>> Jaume OrtolÃ
>>>>>>> www.riuraueditors.cat
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2013/7/15 Marcin MiÅ‚kowski <[email protected]>:
>>>>>>>> W dniu 2013-07-15 10:56, Marcin Miłkowski pisze:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Dawid just released morfologik 1.7 on Maven. So we can actually go
>>>>>>>>> on
>>>>>>>>> and include a newer version in LT.
>>>>>>>>>
>>>>>>>>> The new version still does not support compounding but it has all
>>>>>>>>> the
>>>>>>>>> features required for getting better diacritic suggestions.
>>>>>>>> Here's the documentation:
>>>>>>>>
>>>>>>>> http://wiki.languagetool.org/hunspell-support#toc5
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Marcin
>>>>>>>>
>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Marcin
>>>>>>>>>
>>>>>>>>> W dniu 2013-07-02 08:59, Marcin Miłkowski pisze:
>>>>>>>>>> W dniu 2013-07-02 01:11, Jaume Ortolà i Font pisze:
>>>>>>>>>>> Hi Marcin,
>>>>>>>>>>>
>>>>>>>>>>> I have been using the still unreleased code of
>>>>>>>>>>> morfologik-stemming and I
>>>>>>>>>>> have made improvements to Speller.java for some previously
>>>>>>>>>>> unforseen
>>>>>>>>>>> cases. See the attachement.
>>>>>>>>>>>
>>>>>>>>>>> In order to complete the development, and test & debug with all
>>>>>>>>>>> languages, perhaps we could include temporarily the morfologik
>>>>>>>>>>> module
>>>>>>>>>>> inside LanguageTool. This will make thinks easier. What do yo
>>>>>>>>>>> think?
>>>>>>>>>> No. I should make a release, forking morfologik makes no sense to
>>>>>>>>>> me.
>>>>>>>>>>
>>>>>>>>>> The only thing that stops me is the lack of time to work on
>>>>>>>>>> compounds.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Marcin
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> This SF.net email is sponsored by Windows:
>>>>>>>>>>
>>>>>>>>>> Build for Windows Store.
>>>>>>>>>>
>>>>>>>>>> http://p.sf.net/sfu/windows-dev2dev
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Languagetool-devel mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>>> AppDynamics
>>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>>> _______________________________________________
>>>>>>>> Languagetool-devel mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> See everything from the browser to the database with AppDynamics
>>>>>>>> Get end-to-end visibility with application monitoring from
>>>>>>>> AppDynamics
>>>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Languagetool-devel mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>> ------------------------------------------------------------------------------
>>>>>> See everything from the browser to the database with AppDynamics
>>>>>> Get end-to-end visibility with application monitoring from
>>>>>> AppDynamics
>>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>>> Start your free trial of AppDynamics Pro today!
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>>> _______________________________________________
>>>>>> Languagetool-devel mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>> ------------------------------------------------------------------------------
>>>>> See everything from the browser to the database with AppDynamics
>>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>>> Start your free trial of AppDynamics Pro today!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>
>>>> ------------------------------------------------------------------------------
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Languagetool-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to