subject:"\[RFC\]Japanese tokenization\/tagging restructuring proposal"

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread NOKUBI Takatsugu

At Sun, 24 Aug 2014 14:21:52 +0200,
Silvan Jegen wrote:
 Because the tagger library used by them (called 'sen') does the
 tokenization and tagging in one step, these two steps cannot be separated
 as cleanly as required by the interfaces used in LT.

Yes, almost Japanese morphological analysis systems have such behavor,
it is come from Japanese characteristic.

Japanese sentences has no separation between every words. To analyse,
morphological system calculates by dictionary with words list, POS,
and a kind of score. POS is the important information to determine
that where is a separation of words.

 My proposal would be to avoid this issue by working around the current
 interface as follows.
 
 1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the
text and adds POS tags. We save the tokenized and tagged items in a
private analyzedTokens field of JapaneseWordTokenizer.
 2. The JapaneseWordTokenizer just returns null (or an empty ListString).
 3. When the JapaneseTagger is called with the above (null/empty)
ListString as input we ignore the input parameter. Instead we get the
analyzedTokens field directly from the JapaneseWordTokenizer
(a reference to which we saved within the JapaneseTagger)
and build the needed AnalyzedTokenReadings directly.

I think it would be make sense.

 Before working on the implementation of these changes further I wanted
 to ask whether you think this is the way to go or if we should stick to
 the current behavior.

Maybe the change has no side effect.

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Daniel Naber

On 2014-08-24 14:21, Silvan Jegen wrote:

 3. When the JapaneseTagger is called with the above (null/empty)
ListString as input we ignore the input parameter. Instead we get 
 the
analyzedTokens field directly from the JapaneseWordTokenizer
(a reference to which we saved within the JapaneseTagger)
and build the needed AnalyzedTokenReadings directly.

While I agree that the current situation isn't exactly elegant, I think 
this would be equally confusing.

Regards
  Daniel


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Silvan Jegen

Am 2014-08-25 11:05, schrieb Daniel Naber:
 On 2014-08-24 14:21, Silvan Jegen wrote:
 
 3. When the JapaneseTagger is called with the above (null/empty)
ListString as input we ignore the input parameter. Instead we get
 the
   analyzedTokens field directly from the JapaneseWordTokenizer
   (a reference to which we saved within the JapaneseTagger)
   and build the needed AnalyzedTokenReadings directly.
 
 While I agree that the current situation isn't exactly elegant, I think
 this would be equally confusing.

I agree that it would be about equally confusing (and inelegant) but at 
least it would save some unnecessary work for LT. If we document the 
reasoning behind the new behavior I think the approach I suggest would 
be preferable.

Should I open a pull request on Github when I am done or just leave it 
be?


Cheers,

Silvan

--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Daniel Naber

On 2014-08-25 12:27, Silvan Jegen wrote:

 I agree that it would be about equally confusing (and inelegant) but at
 least it would save some unnecessary work for LT.

I don't think we should argue with performance unless there's a 
real-world use case that's actually too slow and we can show that the 
new solution is actually significantly faster.

Regards
  Daniel


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Silvan Jegen

On Mon, Aug 25, 2014 at 12:47:06PM +0200, Daniel Naber wrote:
 On 2014-08-25 12:27, Silvan Jegen wrote:
 
  I agree that it would be about equally confusing (and inelegant) but at
  least it would save some unnecessary work for LT.
 
 I don't think we should argue with performance unless there's a 
 real-world use case that's actually too slow and we can show that the 
 new solution is actually significantly faster.

I don't know about the real-world use case but I tested both
implementations using languagetool-standalone.jar on a 114MB text file. I
ran both versions ten times and on average the suggested one was about
15% faster (note that it was not very rigorous testing and the difference
between runs was surprisingly high at times).

This simple testing also highlighted an oversight of mine. If the
tokenized ListString result is ignored, the replaceSoftHyphens
function won't have anything to work with. That means that at least
some of the speed gain will be due to this function not being used. Not
handling soft hyphens does make sense for Japanese since they are
only very rarely used. They seem to be allowed according to 3.1.10f in
http://www.w3.org/TR/2009/NOTE-jlreq-20090604/ though.


Cheers,

Silvan


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

[RFC]Japanese tokenization/tagging restructuring proposal

2014-08-24 Thread Silvan Jegen

Hi

I realized that the current implementation of the JapaneseWordTokenizer
and JapaneseTagger work in quite an odd way.

Because the tagger library used by them (called 'sen') does the
tokenization and tagging in one step, these two steps cannot be separated
as cleanly as required by the interfaces used in LT.

The current implementation works as follows.

1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the
   text and adds POS tags.
2. The JapaneseWordTokenizer concatenates the basic form, POStag, and
   surface form with a ' ' (space char) and returns these concatenated
   strings in a ListString.
3. When the JapaneseTagger is called with the above ListString as
   input, it just splits the string on ' ' and uses the resulting strings
 to build the necessary AnalyzedTokenReadings.

That means we are using a String to pass structured data around and
concatenate and split each token unneccessarily. This is less than ideal.

My proposal would be to avoid this issue by working around the current
interface as follows.

1. JapaneseWordTokenizer calls sen's analyze method which tokenizes the
   text and adds POS tags. We save the tokenized and tagged items in a
 private analyzedTokens field of JapaneseWordTokenizer.
2. The JapaneseWordTokenizer just returns null (or an empty ListString).
3. When the JapaneseTagger is called with the above (null/empty)
   ListString as input we ignore the input parameter. Instead we get the
 analyzedTokens field directly from the JapaneseWordTokenizer
 (a reference to which we saved within the JapaneseTagger)
 and build the needed AnalyzedTokenReadings directly.

That way we violate some of the separation-of-concerns principles
(by getting the analyzedTokens for the JapaneseTagger from the
JapaneseWordTokenizer) but avoid having to concatenate-and-split each
input token String for nothing.

Before working on the implementation of these changes further I wanted
to ask whether you think this is the way to go or if we should stick to
the current behavior.


Cheers,

Silvan


--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [RFC]Japanese tokenization/tagging restructuring proposal

Re: [RFC]Japanese tokenization/tagging restructuring proposal

Re: [RFC]Japanese tokenization/tagging restructuring proposal

Re: [RFC]Japanese tokenization/tagging restructuring proposal

Re: [RFC]Japanese tokenization/tagging restructuring proposal

[RFC]Japanese tokenization/tagging restructuring proposal

6 matches

Site Navigation

Mail list logo

Footer information