Hi all,

I have been away for long, but that's other story that must be proofread
with the latest version of LT.

I'm back again and there is a lot of work that needs to be done before
Spanish module is considered "stable".

So I want to share with you my view on the roadmap for Spanish.

*1st and foremost: disambiguator:*

Developing a disambiguator is an endeavour harder than I initially thought.
A little change has large impact on rule triggering, creating a "butterfy
effect" that spreads across the language rules. It can boost or plummet
performance.

So it is critical developing the disambiguator with high quality from
minute one. This is because the accuracy and complexity of the rules in
grammar.xml file are very sensitive to minor disambiguator changes.

Disambiguation changes the strategy of rule design and therefore the rules
should not grow too much until an effective disambiguation is put into
service.

Thank you very much Marcin for the useful disambiguator logging.

My current strategy for disambiguation is starting by the longer
constructions and then downsizing to the two tokens constructions. Positive
and negative examples should be included.

*2nd stage: Dictionary*

I've noticed several rules trigger because incorrect POS discloesd by the
fsa dictionary. This issues can be solved but there are others that cannot.
Some pronouns are attached to verbs and they need to be identified to get a
correct POS tag.

*3rd stage: Rules*

The aim for Spanish is and always has been creating a reduced ruleset with
meaningful rules.

Pick rules for common mistakes.
Use inexpensive regular expressions
Simplify general rules starting from similar rules when possible.
Use synthesis for suggestions when possible.

A rule that is seldom found in common texts but expensive should be
disabled by default.

Rules are grouped by categories and by rule groups. It´s important to put
the rules where they belong so they are easy to find.

*Helper tools*

To ensure the quality of the rules I developed a set of tools combining
bash scripts and graphic diff tools with a varied corpus.

They are basically isolated in a folder but they need to have access to the
deployed command line version of LT.

I am keen to share them but I don't want to taint the LT code. Should I do
that in a separate Github project? It´s just an idea.

Comments are welcome.

Best regards,
Juan Martorell
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their 
applications. Written by three acclaimed leaders in the field, 
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to