[Languagetool] presentation
Hi, I'm interested in using LT as an advanced spell-checker for a document editor I'm building (eclipse RCP). I had a few mail exchange with Daniel Naber in the forum and he redirected me here. I'm a fairly seasoned programmer and I have a good command of Italian and, to a lesser degree, English languages. As a programmer I have good rearing in C, C++ and java, ranging from linux kernel hacking to GUI building (mostly eclipse RCP and Qt). I am Italian and my focus is primarily on the Italian language. I am building a plugin providing ISpellingEngine and related classes, compatible with all eclipse installations. To this end I've found LT to be very promising, somewhat immature (as may be expected). Problems I've found are: 1. Italian rules rise way too many false positive, especially for the tense concordance rule (GR_10_001). I am willing to help refine the rules. 2. Spell checker does not have any way to implement ignore word and add to user dictionary, which is essential for interactive use. I know Daniel is working on a kind of ignore using a file deep into the file hierarchy; this doesn't help an interactive usage and IMHO is not a solution for any serious use-case. I plan to support three independent dictionaries, for each language, in my application: standard, user, document. 3. LT doesn't really understand Unicode (prevents usage on htLaTeX-generated docs): 1. it does not understand ligatures. 2. it does not understand special apostrophe. 3. it does not understand other special chars. 4. LT understanding of XML is very limited as it does not understand xxx; constructs (prevents usage on my document source). I didn't (yet) dig deep in LT code, but I have some ideas I wish to share. In order to overcome the above difficulties I propose the following actions (but I'm open to other suggestions, of course): 1. I can help refining rules for Italian, as tester at beginning, more active if and when I will learn how to write efficient rules. 2. Give a configurable chance to use a different engine for spell checking. The current means are really not useful for interactive use and changing/upgrading dictionaries id not very straightforward. I propose specifically to: 1. define a standard interface between LT and the underlying spellchecker. 2. provide at least two spellcheckers (fsa and hunspell). 3. make it possible to chose at runtime (via Preference Page). 4. decouple checking from suggestion generation; i.e.: split the current check(document) function into review(document) and suggest(word). This would speed-up hunspell verify enough to make it usable for interactive use and could be trivially implemented by fsa doing everything in one step (as it is currently doing) and returning data in two separate steps. 5. provide a thin interface layer to control the underlying spellchecker allowing things as (multiple) dictionary selection and dictionary maintenance (add word). If the underlying is unable to perform the operation the wrapper can simply return an error letting the caller to decide what to do. 3. Use Transliteration to remove all characters not really present in dictionary (e.g. see: http://unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml) using either ICU or gnu iconv. This would allow to normalize (after input processing of escape sequences, see (4)) input to what is actually supported by dictionary. 4. Make sure you correctly detect input structure and behave accordingly. This is not really interesting for interactive use since it must be part of the calling application, but could be very important for standalone usage. This usually boils down to two things: 1. detect parts that are not to be checked (e.g.: XML tags). 2. replace escape sequences in sections that should be checked (e.g.: quot;) This is a kind of filter depending on input structure; it could be made pluggable (to be future-proof) and could be either auto-detected (mime type?), if possible, or specified via command-line-option/API. I know Daniel is not fond of too many user configurations, but IMHO the needs LT can be called to fulfill are so different one size-fits-all strategy is very likely to disappoint almost anyone. I am ready to stand corrected, but the above reflects my current needs and understanding of LT. I am also available to help, in my ample spare time ;) I will do my best to answer any comments in the next few hours, but I will be out-of-town from tomorrow evening till Sept, 15th. After that I will be (more or less) available again. Thanks for the good work and Best Regards Mauro Condarelli -- Live Security Virtual Conference Exclusive live event will cover all the ways
[Languagetool] path changes in SVN
Hi, to become a bit more Maven compatible, I'm going to move directories in SVN today: src/test will become src/test/java src/java will become src/main/java src/dev will become src/main/dev src/rules will become src/main/resources/rules src/resource will become src/main/resources/resource I will try to do that at 19:00 CET. You might want to commit any local changes before that or create a patch, as I'm not sure if merging in your changes will work when the paths change. The long-term goal is to build LT with mvn and to also host it in Maven central. That will of course mean a different build process, and I don't know yet how difficult that will be to implement. Anyway, I'll try to do that step-by-step, keeping everything working all the time. Regards Daniel -- http://www.danielnaber.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] path changes in SVN
Hello Daniel, that won't really make too much of a difference. I've set up a POM which takes into account the current project structure. A better situation for Maven compatibility would split the project into three code modules: - core - dev - openoffice and possibly a number of language resource modules. Maybe you want to start by having a look at the pom I crafted. -- Richard Am 02.09.2012 um 18:00 schrieb Daniel Naber: Hi, to become a bit more Maven compatible, I'm going to move directories in SVN today: src/test will become src/test/java src/java will become src/main/java src/dev will become src/main/dev src/rules will become src/main/resources/rules src/resource will become src/main/resources/resource I will try to do that at 19:00 CET. You might want to commit any local changes before that or create a patch, as I'm not sure if merging in your changes will work when the paths change. The long-term goal is to build LT with mvn and to also host it in Maven central. That will of course mean a different build process, and I don't know yet how difficult that will be to implement. Anyway, I'll try to do that step-by-step, keeping everything working all the time. Regards Daniel -- http://www.danielnaber.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] path changes in SVN
On 02.09.2012, 19:20:56 Richard Eckart de Castilho wrote: Hi Richard, that won't really make too much of a difference. I've set up a POM which takes into account the current project structure. I know it doesn't help with your short-term goal, but when we build LT with mvn, I think we should use the standard directory layout. Maybe you want to start by having a look at the pom I crafted. Could you send the URL again? I think the one in your artifact repo looked almost empty, i.e. there were no dependencies... maybe I just looked at the wrong place. Regards Daniel -- http://www.danielnaber.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] path changes in SVN
I have added a POM for the restructured trunk to the bug as well now. Please mind that Eclipse can be quite confused if you use the m2e plugin to switch the project to a Maven project, because there is already Eclipse metadata present in the source code repository. On the command line, everything should be mostly fine - some tests fail in 1.8 and some more fail in trunk. There is a spelling error in the package of morfologik-speller in the org.carrot artifact (it's morflogik instead of morfologik), so I had to actually change the imports in LanguageTool to match that before the stuff would compile. Cheers, -- Richard Am 02.09.2012 um 19:52 schrieb Richard Eckart de Castilho: Hello Daniel, that won't really make too much of a difference. I've set up a POM which takes into account the current project structure. I know it doesn't help with your short-term goal, but when we build LT with mvn, I think we should use the standard directory layout. That's a good idea. However, the standard layout only includes src/main/java and src/main/test. I think dev should best go to it's own module. Maybe you are already planning that. Risking that I repeat other things you already though about, I'll just mention what else I noticed when I build the POM: - the i18n properties files are in the regular source folder. In the standard Maven layout, they should to go to src/main/resources. - I think that rules and resource would best be kept somewhere under a org/langaugetool package to avoid any potential conflict with other artifacts. Maybe you want to start by having a look at the pom I crafted. Could you send the URL again? I think the one in your artifact repo looked almost empty, i.e. there were no dependencies... maybe I just looked at the wrong place. I am not sure where you were looking. I attached a POM for 1.8 to this issue https://sourceforge.net/tracker/index.php?func=detailaid=3564184group_id=110216atid=655717 I'll also add one for trunk now to the same issue. Probably I'll go on with trying to get cjftransform and ictclas4j to Maven Central next. Cheers, -- Richard -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] path changes in SVN
On 02.09.2012, 18:00:41 Daniel Naber wrote: to become a bit more Maven compatible, I'm going to move directories in SVN today: It would be nice if an Eclipse user could update the .profile and .classpath files accordingly. Regards Daniel -- http://www.danielnaber.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] path changes in SVN
Am 02.09.2012 um 21:37 schrieb Daniel Naber: On 02.09.2012, 18:00:41 Daniel Naber wrote: to become a bit more Maven compatible, I'm going to move directories in SVN today: It would be nice if an Eclipse user could update the .profile and .classpath files accordingly. If you plan to move to Maven, I'd recommend not keeping Eclipse metadata in the repository. That would only be useful for Eclipse users that do not use Maven. It would confuse things for Eclipse users that are actually using Maven. -- Richard -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] presentation
On 02.09.2012, 15:42:26 Mauro Condarelli wrote: Hi Mauro, I'm interested in using LT as an advanced spell-checker for a document editor I'm building (eclipse RCP). welcome to the list and thanks for your ideas! I'm just trying to add some additional points to what Dominique already mentioned. know Daniel is working on a kind of ignore using a file deep into the file hierarchy; this doesn't help an interactive usage and IMHO is not a solution for any serious use-case. It's for a different use case: our rules might make suggestions that are so specific that the spell checker doesn't know them, and it would be unfortunate if we correct to something that the spell checker then complains about. It is indeed something else than ignore word from the user's point of view (which isn't implemented yet - help is welcome). 3. LT doesn't really understand Unicode (prevents usage on htLaTeX-generated docs): 1. it does not understand ligatures. 2. it does not understand special apostrophe. 3. it does not understand other special chars. There's http://en.wikipedia.org/wiki/Unicode_equivalence and we might want to use that. 4. LT understanding of XML is very limited as it does not understand xxx; constructs (prevents usage on my document source). This should be done outside of LT, we should basically only work on plain text. 1. I can help refining rules for Italian, as tester at beginning, more active if and when I will learn how to write efficient rules. That's great! 1. define a standard interface between LT and the underlying spellchecker. Everything that detects in error in LT is a subclass of Rule, and for spell checking we use SpellingCheckRule, which already has two subclasses (for Hunspell and for Morfologik). 3. make it possible to chose at runtime (via Preference Page). As I mentioned, I'd rather prefer no configuration, as this is something too complex for the user to decide. 3. Use Transliteration to remove all characters not really present in dictionary (e.g. see: This seems to be a lossy step, so Unicode normalization (see above) might be more appropriate. I know Daniel is not fond of too many user configurations, but IMHO the needs LT can be called to fulfill are so different one size-fits-all strategy is very likely to disappoint almost anyone. People embedding LT into their own applications already have full freedom, e.g. they can create their own rules and deactivate ours. I think this helps a lot. LT is basically (at least) two things: a Java library and an application. If we manage to put those two in their own maven modules, this could help us to get a clearer picture of what needs to be done where. So again, welcome on the list. You might want to check out http://www.languagetool.org/development/ for getting started with writing rules. Regards Daniel -- http://www.danielnaber.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [Languagetool] path changes in SVN
On 02.09.2012, 23:59:18 Daniel Naber wrote: Sorry for the hassle, but I need to move rules and resources again. This should be finished now, please don't forget to update. If something seems messed up you might want to consider a fresh re-checkout of https://languagetool.svn.sourceforge.net/svnroot/languagetool/trunk/JLanguageTool Regards Daniel -- http://www.danielnaber.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel