Hi

I realized that the current implementation of the JapaneseWordTokenizer
and JapaneseTagger work in quite an odd way.

Because the tagger library used by them (called 'sen') does the
tokenization and tagging in one step, these two steps cannot be separated
as cleanly as required by the interfaces used in LT.

The current implementation works as follows.

1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the
   text and adds POS tags.
2. The JapaneseWordTokenizer concatenates the basic form, POStag, and
   surface form with a ' ' (space char) and returns these concatenated
   strings in a List<String>.
3. When the JapaneseTagger is called with the above List<String> as
   input, it just splits the string on ' ' and uses the resulting strings
         to build the necessary AnalyzedTokenReadings.

That means we are using a String to pass structured data around and
concatenate and split each token unneccessarily. This is less than ideal.

My proposal would be to avoid this issue by working around the current
interface as follows.

1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the
   text and adds POS tags. We save the tokenized and tagged items in a
         private "analyzedTokens" field of JapaneseWordTokenizer.
2. The JapaneseWordTokenizer just returns null (or an empty List<String>).
3. When the JapaneseTagger is called with the above (null/empty)
   List<String> as input we ignore the input parameter. Instead we get the
         "analyzedTokens" field directly from the JapaneseWordTokenizer
         (a reference to which we saved within the JapaneseTagger)
         and build the needed AnalyzedTokenReadings directly.

That way we violate some of the separation-of-concerns principles
(by getting the "analyzedTokens" for the JapaneseTagger from the
JapaneseWordTokenizer) but avoid having to concatenate-and-split each
input token String for nothing.

Before working on the implementation of these changes further I wanted
to ask whether you think this is the way to go or if we should stick to
the current behavior.


Cheers,

Silvan


------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to