Re: [RFC]Japanese tokenization/tagging restructuring proposal
At Sun, 24 Aug 2014 14:21:52 +0200, Silvan Jegen wrote: Because the tagger library used by them (called 'sen') does the tokenization and tagging in one step, these two steps cannot be separated as cleanly as required by the interfaces used in LT. Yes, almost Japanese morphological analysis systems have such behavor, it is come from Japanese characteristic. Japanese sentences has no separation between every words. To analyse, morphological system calculates by dictionary with words list, POS, and a kind of score. POS is the important information to determine that where is a separation of words. My proposal would be to avoid this issue by working around the current interface as follows. 1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the text and adds POS tags. We save the tokenized and tagged items in a private analyzedTokens field of JapaneseWordTokenizer. 2. The JapaneseWordTokenizer just returns null (or an empty ListString). 3. When the JapaneseTagger is called with the above (null/empty) ListString as input we ignore the input parameter. Instead we get the analyzedTokens field directly from the JapaneseWordTokenizer (a reference to which we saved within the JapaneseTagger) and build the needed AnalyzedTokenReadings directly. I think it would be make sense. Before working on the implementation of these changes further I wanted to ask whether you think this is the way to go or if we should stick to the current behavior. Maybe the change has no side effect. -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [RFC]Japanese tokenization/tagging restructuring proposal
On 2014-08-24 14:21, Silvan Jegen wrote: 3. When the JapaneseTagger is called with the above (null/empty) ListString as input we ignore the input parameter. Instead we get the analyzedTokens field directly from the JapaneseWordTokenizer (a reference to which we saved within the JapaneseTagger) and build the needed AnalyzedTokenReadings directly. While I agree that the current situation isn't exactly elegant, I think this would be equally confusing. Regards Daniel -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [RFC]Japanese tokenization/tagging restructuring proposal
Am 2014-08-25 11:05, schrieb Daniel Naber: On 2014-08-24 14:21, Silvan Jegen wrote: 3. When the JapaneseTagger is called with the above (null/empty) ListString as input we ignore the input parameter. Instead we get the analyzedTokens field directly from the JapaneseWordTokenizer (a reference to which we saved within the JapaneseTagger) and build the needed AnalyzedTokenReadings directly. While I agree that the current situation isn't exactly elegant, I think this would be equally confusing. I agree that it would be about equally confusing (and inelegant) but at least it would save some unnecessary work for LT. If we document the reasoning behind the new behavior I think the approach I suggest would be preferable. Should I open a pull request on Github when I am done or just leave it be? Cheers, Silvan -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [RFC]Japanese tokenization/tagging restructuring proposal
On 2014-08-25 12:27, Silvan Jegen wrote: I agree that it would be about equally confusing (and inelegant) but at least it would save some unnecessary work for LT. I don't think we should argue with performance unless there's a real-world use case that's actually too slow and we can show that the new solution is actually significantly faster. Regards Daniel -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: [RFC]Japanese tokenization/tagging restructuring proposal
On Mon, Aug 25, 2014 at 12:47:06PM +0200, Daniel Naber wrote: On 2014-08-25 12:27, Silvan Jegen wrote: I agree that it would be about equally confusing (and inelegant) but at least it would save some unnecessary work for LT. I don't think we should argue with performance unless there's a real-world use case that's actually too slow and we can show that the new solution is actually significantly faster. I don't know about the real-world use case but I tested both implementations using languagetool-standalone.jar on a 114MB text file. I ran both versions ten times and on average the suggested one was about 15% faster (note that it was not very rigorous testing and the difference between runs was surprisingly high at times). This simple testing also highlighted an oversight of mine. If the tokenized ListString result is ignored, the replaceSoftHyphens function won't have anything to work with. That means that at least some of the speed gain will be due to this function not being used. Not handling soft hyphens does make sense for Japanese since they are only very rarely used. They seem to be allowed according to 3.1.10f in http://www.w3.org/TR/2009/NOTE-jlreq-20090604/ though. Cheers, Silvan -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
[RFC]Japanese tokenization/tagging restructuring proposal
Hi I realized that the current implementation of the JapaneseWordTokenizer and JapaneseTagger work in quite an odd way. Because the tagger library used by them (called 'sen') does the tokenization and tagging in one step, these two steps cannot be separated as cleanly as required by the interfaces used in LT. The current implementation works as follows. 1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the text and adds POS tags. 2. The JapaneseWordTokenizer concatenates the basic form, POStag, and surface form with a ' ' (space char) and returns these concatenated strings in a ListString. 3. When the JapaneseTagger is called with the above ListString as input, it just splits the string on ' ' and uses the resulting strings to build the necessary AnalyzedTokenReadings. That means we are using a String to pass structured data around and concatenate and split each token unneccessarily. This is less than ideal. My proposal would be to avoid this issue by working around the current interface as follows. 1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the text and adds POS tags. We save the tokenized and tagged items in a private analyzedTokens field of JapaneseWordTokenizer. 2. The JapaneseWordTokenizer just returns null (or an empty ListString). 3. When the JapaneseTagger is called with the above (null/empty) ListString as input we ignore the input parameter. Instead we get the analyzedTokens field directly from the JapaneseWordTokenizer (a reference to which we saved within the JapaneseTagger) and build the needed AnalyzedTokenReadings directly. That way we violate some of the separation-of-concerns principles (by getting the analyzedTokens for the JapaneseTagger from the JapaneseWordTokenizer) but avoid having to concatenate-and-split each input token String for nothing. Before working on the implementation of these changes further I wanted to ask whether you think this is the way to go or if we should stick to the current behavior. Cheers, Silvan -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel