Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread NOKUBI Takatsugu
At Sun, 24 Aug 2014 14:21:52 +0200, Silvan Jegen wrote: Because the tagger library used by them (called 'sen') does the tokenization and tagging in one step, these two steps cannot be separated as cleanly as required by the interfaces used in LT. Yes, almost Japanese morphological analysis

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Daniel Naber
On 2014-08-24 14:21, Silvan Jegen wrote: 3. When the JapaneseTagger is called with the above (null/empty) ListString as input we ignore the input parameter. Instead we get the analyzedTokens field directly from the JapaneseWordTokenizer (a reference to which we saved within

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Silvan Jegen
Am 2014-08-25 11:05, schrieb Daniel Naber: On 2014-08-24 14:21, Silvan Jegen wrote: 3. When the JapaneseTagger is called with the above (null/empty) ListString as input we ignore the input parameter. Instead we get the analyzedTokens field directly from the JapaneseWordTokenizer

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Daniel Naber
On 2014-08-25 12:27, Silvan Jegen wrote: I agree that it would be about equally confusing (and inelegant) but at least it would save some unnecessary work for LT. I don't think we should argue with performance unless there's a real-world use case that's actually too slow and we can show that

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Silvan Jegen
On Mon, Aug 25, 2014 at 12:47:06PM +0200, Daniel Naber wrote: On 2014-08-25 12:27, Silvan Jegen wrote: I agree that it would be about equally confusing (and inelegant) but at least it would save some unnecessary work for LT. I don't think we should argue with performance unless there's a

[RFC]Japanese tokenization/tagging restructuring proposal

2014-08-24 Thread Silvan Jegen
Hi I realized that the current implementation of the JapaneseWordTokenizer and JapaneseTagger work in quite an odd way. Because the tagger library used by them (called 'sen') does the tokenization and tagging in one step, these two steps cannot be separated as cleanly as required by the