W dniu 2015-03-03 o 22:59, Daniel Naber pisze:
> On 2015-03-03 14:36, Andriy Rysin wrote:
>
>> I installed jdk1.7.0_75 and German tests pass with it so it's java 8
>> which makes it fail.
>
> I did some debugging and the problem is caused by the elements in
> Unifier that we iterate over but that have no guaranteed order, like
> Maps and Sets. By mechanically replacing them with classes that have an
> order (e.g. ConcurrentHashMap -> LinkedHashMap), I could make Java 7 and
> Java 8 behave the same way. Actually the wrong way, because then the
> unification fails under Java 7, too. So we'll need to change the
> algorithm so it doesn't depend on the order of elements. I'll share my
> "debugging" branch. Warning: it's full of System.out.println and changes
> just for debugging. Run a check on "Die diplomatischen Beziehungen" with
> Java 7 and Java 8 and that branch and you'll see the differences in the
> output.

Actually, I think I have found something close to the cause of the bug: 
the thing  is that some readings are assigned attribute values that they 
don't really have. For example, in Java 8, the reading 
"PRO:DEM:NOM:SIN:FEM" of "Die" is assigned both "singular" and "plural" 
values of the "number" attribute. Unification, as far as I can see, 
works fine afterwards; it's just in Java 8 the lack of order in elements 
in the Map does not stop the algorithm from being wrong.

I think we are getting close to the point where we should add a generic 
attribute-value interface to our AnalyzedTokens. The Unifier is so 
complex because it does two things at the same time:

- checking token attribute values, by using regexes (which is 
computationally costly, and it's computed many times);
- running unification on the values.

If we could move the first part of the code to another class, which 
would analyze POS tags to get proper values of attributes, the code 
would be cleaner and faster. The basic attribute-value class could 
contain several default attributes (they probably need to be addressable 
by Strings to make them easily extended by subclasses for new languages 
and new tagsets), such as number, case, gender, and tense. Not all 
languages need to have such attribute values in their tagsets, but they 
need to implement a POS tag analyzer if they want to use these attributes.

Another advantage of this setup would be that we could easily use 
computationally cheap tests in our grammar rules, for example by having:

<token><attribute id="reflexivity"><value 
id="reflexive"/></attribute></token>

for language-dependent attributes (not defined in our XML schema).

And more terse for default attributes:

<token number="singular"/>

Because these values would be precomputed, no regex would be evaluated, 
just a very quick equal() test on the AnalyzedToken.

All we need for this is:

- an attribute-value class that would be a member of the AnalyzedToken, 
probably a Map to a Set;
- a POS tag analyzer class, which would assign empty attributes in the 
generic version, and would be subclassed by all languages that have 
tagsets; for most positional tagsets, we don't need regexes to parse the 
tags, so this could be really fast (for determiners in German, for 
example, we simply need to split the string by ":", and read the String 
at a given constant position in the array).
- some trivial extensions in the Element class and the PatternRuleLoader.

Regards,
Marcin

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to