[Languagetool] presentation

2012-09-02 Thread Mauro Condarelli
Hi,
I'm interested in using LT as an advanced spell-checker for a document
editor I'm building (eclipse RCP).
I had a few mail exchange with Daniel Naber in the forum and he
redirected me here.

I'm a fairly seasoned programmer and I have a good command of Italian
and, to a lesser degree, English languages.
As a programmer I have good rearing in C, C++ and java, ranging from
linux kernel hacking to GUI building (mostly eclipse RCP and Qt).

I am Italian and my focus is primarily on the Italian language.

I am building a plugin providing ISpellingEngine and related classes,
compatible with all eclipse installations.

To this end I've found LT to be very promising, somewhat immature (as
may be expected).
Problems I've found are:

 1. Italian rules rise way too many false positive, especially for the
tense concordance rule (GR_10_001). I am willing to help refine the
rules.
 2. Spell checker does not have any way to implement ignore word and
add to user dictionary, which is essential for interactive use. I
know Daniel is working on a kind of ignore using a file deep into
the file hierarchy; this doesn't help an interactive usage and IMHO
is not a solution for any serious use-case. I plan to support three
independent dictionaries, for each language, in my application:
standard, user, document.
 3. LT doesn't really understand Unicode (prevents usage on
htLaTeX-generated docs):
 1. it does not understand ligatures.
 2. it does not understand special apostrophe.
 3. it does not understand other special chars.
 4. LT understanding of XML is very limited as it does not understand
xxx; constructs (prevents usage on my document source).

I didn't (yet) dig deep in LT code, but I have some ideas I wish to share.
In order to overcome the above difficulties I propose the following
actions (but I'm open to other suggestions, of course):

 1. I can help refining rules for Italian, as tester at beginning, more
active if and when I will learn how to write efficient rules.
 2. Give a configurable chance to use a different engine for spell
checking. The current means are really not useful for interactive
use and changing/upgrading dictionaries id not very straightforward.
I propose specifically to:
 1. define a standard interface between LT and the underlying
spellchecker.
 2. provide at least two spellcheckers (fsa and hunspell).
 3. make it possible to chose at runtime (via Preference Page).
 4. decouple checking from suggestion generation; i.e.: split the
current check(document) function into review(document) and
suggest(word). This would speed-up hunspell verify enough to
make it usable for interactive use and could be trivially
implemented by fsa doing everything in one step (as it is
currently doing) and returning data in two separate steps.
 5. provide a thin interface layer to control the underlying
spellchecker allowing things as (multiple) dictionary selection
and dictionary maintenance (add word). If the underlying is
unable to perform the operation the wrapper can simply return an
error letting the caller to decide what to do.
 3. Use Transliteration to remove all characters not really present in
dictionary (e.g. see:
http://unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml)
using either ICU or gnu iconv. This would allow to normalize
(after input processing of escape sequences, see (4)) input to what
is actually supported by dictionary.
 4. Make sure you correctly detect input structure and behave
accordingly. This is not really interesting for interactive use
since it must be part of the calling application, but could be very
important for standalone usage. This usually boils down to two things:
 1. detect parts that are not to be checked (e.g.: XML tags).
 2. replace escape sequences in sections that should be checked
(e.g.: quot;)

This is a kind of filter depending on input structure; it could be
made pluggable (to be future-proof) and could be either
auto-detected (mime type?), if possible, or specified via
command-line-option/API.

I know Daniel is not fond of too many user configurations, but IMHO the
needs LT can be called to fulfill are so different one size-fits-all
strategy is very likely to disappoint almost anyone.
I am ready to stand corrected, but the above reflects my current needs
and understanding of LT.
I am also available to help, in my ample spare time ;)

I will do my best to answer any comments in the next few hours, but I
will be out-of-town from tomorrow evening till Sept, 15th. After that I
will be (more or less) available again.

Thanks for the good work and
Best Regards
Mauro Condarelli
--
Live Security Virtual Conference
Exclusive live event will cover all the ways 

[Languagetool] path changes in SVN

2012-09-02 Thread Daniel Naber
Hi,

to become a bit more Maven compatible, I'm going to move directories in SVN 
today:

src/test will become src/test/java
src/java will become src/main/java
src/dev will become src/main/dev

src/rules will become src/main/resources/rules
src/resource will become src/main/resources/resource

I will try to do that at 19:00 CET. You might want to commit any local 
changes before that or create a patch, as I'm not sure if merging in your 
changes will work when the paths change.

The long-term goal is to build LT with mvn and to also host it in Maven 
central. That will of course mean a different build process, and I don't 
know yet how difficult that will be to implement. Anyway, I'll try to do that 
step-by-step, keeping everything working all the time.

Regards
 Daniel

-- 
http://www.danielnaber.de


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] path changes in SVN

2012-09-02 Thread Richard Eckart de Castilho
Hello Daniel,

that won't really make too much of a difference. I've set up a POM which takes 
into account the current project structure.

A better situation for Maven compatibility would split the project into three 
code modules:

- core
- dev
- openoffice

and possibly a number of language resource modules.

Maybe you want to start by having a look at the pom I crafted.

-- Richard

Am 02.09.2012 um 18:00 schrieb Daniel Naber:

 Hi,
 
 to become a bit more Maven compatible, I'm going to move directories in SVN 
 today:
 
 src/test will become src/test/java
 src/java will become src/main/java
 src/dev will become src/main/dev
 
 src/rules will become src/main/resources/rules
 src/resource will become src/main/resources/resource
 
 I will try to do that at 19:00 CET. You might want to commit any local 
 changes before that or create a patch, as I'm not sure if merging in your 
 changes will work when the paths change.
 
 The long-term goal is to build LT with mvn and to also host it in Maven 
 central. That will of course mean a different build process, and I don't 
 know yet how difficult that will be to implement. Anyway, I'll try to do that 
 step-by-step, keeping everything working all the time.
 
 Regards
 Daniel
 
 -- 
 http://www.danielnaber.de
 
 
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and 
 threat landscape has changed and how IT managers can respond. Discussions 
 will include endpoint security, mobile security and the latest in malware 
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Languagetool-devel mailing list
 Languagetool-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/languagetool-devel


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] path changes in SVN

2012-09-02 Thread Daniel Naber
On 02.09.2012, 19:20:56 Richard Eckart de Castilho wrote:

Hi Richard,

 that won't really make too much of a difference. I've set up a POM which
 takes into account the current project structure.

I know it doesn't help with your short-term goal, but when we build LT with 
mvn, I think we should use the standard directory layout.

 Maybe you want to start by having a look at the pom I crafted.

Could you send the URL again? I think the one in your artifact repo looked 
almost empty, i.e. there were no dependencies... maybe I just looked at the 
wrong place.

Regards
 Daniel

-- 
http://www.danielnaber.de


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] path changes in SVN

2012-09-02 Thread Richard Eckart de Castilho
I have added a POM for the restructured trunk to the bug as well now. Please 
mind that Eclipse can be quite confused if you use the m2e plugin to switch the 
project to a Maven project, because there is already Eclipse metadata present 
in the source code repository. On the command line, everything should be mostly 
fine - some tests fail in 1.8 and some more fail in trunk.

There is a spelling error in the package of morfologik-speller in the 
org.carrot artifact (it's morflogik instead of morfologik), so I had to 
actually change the imports in LanguageTool to match that before the stuff 
would compile.

Cheers,

-- Richard

Am 02.09.2012 um 19:52 schrieb Richard Eckart de Castilho:

 Hello Daniel,
 
 that won't really make too much of a difference. I've set up a POM which
 takes into account the current project structure.
 
 I know it doesn't help with your short-term goal, but when we build LT with 
 mvn, I think we should use the standard directory layout.
 
 That's a good idea. However, the standard layout only includes src/main/java 
 and src/main/test. I think dev should best go to it's own module. Maybe you 
 are already planning that. Risking that I repeat other things you already 
 though about, I'll just mention what else I noticed when I build the POM:
 
 -  the i18n properties files are in the regular source folder. In the 
 standard Maven layout, they should to go to src/main/resources. 
 - I think that rules and resource would best be kept somewhere under a 
 org/langaugetool package to avoid any potential conflict with other 
 artifacts.
 
 Maybe you want to start by having a look at the pom I crafted.
 
 Could you send the URL again? I think the one in your artifact repo looked 
 almost empty, i.e. there were no dependencies... maybe I just looked at the 
 wrong place.
 
 I am not sure where you were looking. I attached a POM for 1.8 to this issue
 
   
 https://sourceforge.net/tracker/index.php?func=detailaid=3564184group_id=110216atid=655717
 
 I'll also add one for trunk now to the same issue.
 
 Probably I'll go on with trying to get cjftransform and ictclas4j to Maven 
 Central next.
 
 Cheers,
 
 -- Richard


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] path changes in SVN

2012-09-02 Thread Daniel Naber
On 02.09.2012, 18:00:41 Daniel Naber wrote:

 to become a bit more Maven compatible, I'm going to move directories in
 SVN  today:

It would be nice if an Eclipse user could update the .profile and .classpath 
files accordingly.

Regards
 Daniel

-- 
http://www.danielnaber.de


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] path changes in SVN

2012-09-02 Thread Richard Eckart de Castilho
Am 02.09.2012 um 21:37 schrieb Daniel Naber:

 On 02.09.2012, 18:00:41 Daniel Naber wrote:
 
 to become a bit more Maven compatible, I'm going to move directories in
 SVN  today:
 
 It would be nice if an Eclipse user could update the .profile and .classpath 
 files accordingly.

If you plan to move to Maven, I'd recommend not keeping Eclipse metadata in the 
repository.
That would only be useful for Eclipse users that do not use Maven. It would 
confuse things for
Eclipse users that are actually using Maven.

-- Richard


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] presentation

2012-09-02 Thread Daniel Naber
On 02.09.2012, 15:42:26 Mauro Condarelli wrote:

Hi Mauro,

 I'm interested in using LT as an advanced spell-checker for a document
 editor I'm building (eclipse RCP).

welcome to the list and thanks for your ideas!

I'm just trying to add some additional points to what Dominique already 
mentioned.

 know Daniel is working on a kind of ignore using a file deep into
 the file hierarchy; this doesn't help an interactive usage and IMHO
 is not a solution for any serious use-case.

It's for a different use case: our rules might make suggestions that are so 
specific that the spell checker doesn't know them, and it would be 
unfortunate if we correct to something that the spell checker then 
complains about. It is indeed something else than ignore word from the 
user's point of view (which isn't implemented yet - help is welcome).

  3. LT doesn't really understand Unicode (prevents usage on
 htLaTeX-generated docs):
  1. it does not understand ligatures.
  2. it does not understand special apostrophe.
  3. it does not understand other special chars.

There's http://en.wikipedia.org/wiki/Unicode_equivalence and we might want 
to use that. 

  4. LT understanding of XML is very limited as it does not understand
 xxx; constructs (prevents usage on my document source).

This should be done outside of LT, we should basically only work on plain 
text.

  1. I can help refining rules for Italian, as tester at beginning, more
 active if and when I will learn how to write efficient rules.

That's great!

  1. define a standard interface between LT and the underlying
 spellchecker.

Everything that detects in error in LT is a subclass of Rule, and for spell 
checking we use SpellingCheckRule, which already has two subclasses (for 
Hunspell and for Morfologik).

  3. make it possible to chose at runtime (via Preference Page).

As I mentioned, I'd rather prefer no configuration, as this is something too 
complex for the user to decide.

  3. Use Transliteration to remove all characters not really present in
 dictionary (e.g. see:

This seems to be a lossy step, so Unicode normalization (see above) might 
be more appropriate.

 I know Daniel is not fond of too many user configurations, but IMHO the
 needs LT can be called to fulfill are so different one size-fits-all
 strategy is very likely to disappoint almost anyone.

People embedding LT into their own applications already have full freedom, 
e.g. they can create their own rules and deactivate ours. I think this 
helps a lot.

LT is basically (at least) two things: a Java library and an application. 
If we manage to put those two in their own maven modules, this could help 
us to get a clearer picture of what needs to be done where.


So again, welcome on the list. You might want to check out 
http://www.languagetool.org/development/ for getting started with writing 
rules.

Regards
 Daniel

-- 
http://www.danielnaber.de


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: [Languagetool] path changes in SVN

2012-09-02 Thread Daniel Naber
On 02.09.2012, 23:59:18 Daniel Naber wrote:

 Sorry for the hassle, but I need to move rules and resources again.

This should be finished now, please don't forget to update. If something 
seems messed up you might want to consider a fresh re-checkout of 
https://languagetool.svn.sourceforge.net/svnroot/languagetool/trunk/JLanguageTool

Regards
 Daniel

-- 
http://www.danielnaber.de


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel