Hi I realized that the current implementation of the JapaneseWordTokenizer and JapaneseTagger work in quite an odd way.
Because the tagger library used by them (called 'sen') does the tokenization and tagging in one step, these two steps cannot be separated as cleanly as required by the interfaces used in LT. The current implementation works as follows. 1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the text and adds POS tags. 2. The JapaneseWordTokenizer concatenates the basic form, POStag, and surface form with a ' ' (space char) and returns these concatenated strings in a List<String>. 3. When the JapaneseTagger is called with the above List<String> as input, it just splits the string on ' ' and uses the resulting strings to build the necessary AnalyzedTokenReadings. That means we are using a String to pass structured data around and concatenate and split each token unneccessarily. This is less than ideal. My proposal would be to avoid this issue by working around the current interface as follows. 1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the text and adds POS tags. We save the tokenized and tagged items in a private "analyzedTokens" field of JapaneseWordTokenizer. 2. The JapaneseWordTokenizer just returns null (or an empty List<String>). 3. When the JapaneseTagger is called with the above (null/empty) List<String> as input we ignore the input parameter. Instead we get the "analyzedTokens" field directly from the JapaneseWordTokenizer (a reference to which we saved within the JapaneseTagger) and build the needed AnalyzedTokenReadings directly. That way we violate some of the separation-of-concerns principles (by getting the "analyzedTokens" for the JapaneseTagger from the JapaneseWordTokenizer) but avoid having to concatenate-and-split each input token String for nothing. Before working on the implementation of these changes further I wanted to ask whether you think this is the way to go or if we should stick to the current behavior. Cheers, Silvan ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel