W dniu 2015-03-03 o 22:59, Daniel Naber pisze: > On 2015-03-03 14:36, Andriy Rysin wrote: > >> I installed jdk1.7.0_75 and German tests pass with it so it's java 8 >> which makes it fail. > > I did some debugging and the problem is caused by the elements in > Unifier that we iterate over but that have no guaranteed order, like > Maps and Sets. By mechanically replacing them with classes that have an > order (e.g. ConcurrentHashMap -> LinkedHashMap), I could make Java 7 and > Java 8 behave the same way. Actually the wrong way, because then the > unification fails under Java 7, too. So we'll need to change the > algorithm so it doesn't depend on the order of elements. I'll share my > "debugging" branch. Warning: it's full of System.out.println and changes > just for debugging. Run a check on "Die diplomatischen Beziehungen" with > Java 7 and Java 8 and that branch and you'll see the differences in the > output.
Actually, I think I have found something close to the cause of the bug: the thing is that some readings are assigned attribute values that they don't really have. For example, in Java 8, the reading "PRO:DEM:NOM:SIN:FEM" of "Die" is assigned both "singular" and "plural" values of the "number" attribute. Unification, as far as I can see, works fine afterwards; it's just in Java 8 the lack of order in elements in the Map does not stop the algorithm from being wrong. I think we are getting close to the point where we should add a generic attribute-value interface to our AnalyzedTokens. The Unifier is so complex because it does two things at the same time: - checking token attribute values, by using regexes (which is computationally costly, and it's computed many times); - running unification on the values. If we could move the first part of the code to another class, which would analyze POS tags to get proper values of attributes, the code would be cleaner and faster. The basic attribute-value class could contain several default attributes (they probably need to be addressable by Strings to make them easily extended by subclasses for new languages and new tagsets), such as number, case, gender, and tense. Not all languages need to have such attribute values in their tagsets, but they need to implement a POS tag analyzer if they want to use these attributes. Another advantage of this setup would be that we could easily use computationally cheap tests in our grammar rules, for example by having: <token><attribute id="reflexivity"><value id="reflexive"/></attribute></token> for language-dependent attributes (not defined in our XML schema). And more terse for default attributes: <token number="singular"/> Because these values would be precomputed, no regex would be evaluated, just a very quick equal() test on the AnalyzedToken. All we need for this is: - an attribute-value class that would be a member of the AnalyzedToken, probably a Map to a Set; - a POS tag analyzer class, which would assign empty attributes in the generic version, and would be subclassed by all languages that have tagsets; for most positional tagsets, we don't need regexes to parse the tags, so this could be really fast (for determiners in German, for example, we simply need to split the string by ":", and read the String at a given constant position in the array). - some trivial extensions in the Element class and the PatternRuleLoader. Regards, Marcin ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel