[RFC][PATCH] Add analyzed token readings to failed bad sentence test output

2014-08-10 Thread Silvan Jegen
. Signed-off-by: Silvan Jegen s.je...@gmail.com --- Hi I had difficulties when creating Japanese rules because the mecab program I used to determine the tokenization of the example phrases produced different tokens than the tokenization library used in languagetool. It took me quite a while to find

Re: [RFC][PATCH] Add analyzed token readings to failed bad sentence test output

2014-08-11 Thread Silvan Jegen
Am 2014-08-11 01:07, schrieb Daniel Naber: On 2014-08-10 17:37, Silvan Jegen wrote: If including the analyzed token readings is useful in other assertion messages as well, it may also be better to refactor the token reading code into its own function and making it less ad hoc. What do you

Re: [RFC][PATCH] Add analyzed token readings to failed bad sentence test output

2014-08-11 Thread Silvan Jegen
Am 2014-08-11 10:18, schrieb Daniel Naber: On 2014-08-11 09:01, Silvan Jegen wrote: Maybe it would be best to automatically generate a mail to this dev list whenever a Github issue has been opened... I agree. Do you know an easy way to set that up, or do we need to create a fake user

[RFC]Japanese tokenization/tagging restructuring proposal

2014-08-24 Thread Silvan Jegen
Hi I realized that the current implementation of the JapaneseWordTokenizer and JapaneseTagger work in quite an odd way. Because the tagger library used by them (called 'sen') does the tokenization and tagging in one step, these two steps cannot be separated as cleanly as required by the

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Silvan Jegen
Am 2014-08-25 11:05, schrieb Daniel Naber: On 2014-08-24 14:21, Silvan Jegen wrote: 3. When the JapaneseTagger is called with the above (null/empty) ListString as input we ignore the input parameter. Instead we get the analyzedTokens field directly from the JapaneseWordTokenizer

Re: [RFC]Japanese tokenization/tagging restructuring proposal

2014-08-25 Thread Silvan Jegen
On Mon, Aug 25, 2014 at 12:47:06PM +0200, Daniel Naber wrote: On 2014-08-25 12:27, Silvan Jegen wrote: I agree that it would be about equally confusing (and inelegant) but at least it would save some unnecessary work for LT. I don't think we should argue with performance unless there's

Re: Good way to handle invalid prolonged sound mark

2015-05-13 Thread Silvan Jegen
On Wed, May 13, 2015 at 05:16:10PM +0900, NOKUBI Takatsugu wrote: On Wed, 13 May 2015 09:12:14 +0200 Daniel Naber daniel.na...@languagetool.org wrote: token regexp=yes[\u3040-\u309F]+/token tokenX/token There is some exepction, like a long consontants character (っ) but it is not bad to

Re: Good way to handle invalid prolonged sound mark

2015-05-13 Thread Silvan Jegen
Hi Thanks for considering to write a grammar rule for Japanese! Am 2015-05-13 07:43, schrieb Takatsugu Nokubi: I consider to write a grammar rule of Japanese. ー (prolonged sound mark) is a popular symbol in Japanese. And the rule itself is simple: The symbol is placed after Hiragana or

Re: Possible bug in Japanese rule

2016-01-20 Thread Silvan Jegen
Heyho On Wed, Jan 20, 2016 at 03:09:29PM -0800, Rick Genter wrote: > There is a rule in the Japanese grammar.xml that says this: > > > > かっこいい > > 誤変換です。恰好良いの間違いです。 > あの人はかっこいい。 > > > A Japanese colleague of mine says that the suggestion is using the wrong > first character:

Re: probability theory code review?

2016-05-06 Thread Silvan Jegen
Hi Sadly, my math is weak but I will give it a try. Just make sure to re-check :) On Thu, Aug 06, 2015 at 11:29:05AM +0200, Daniel Naber wrote: > we're using a bit probability theory to calculate ngram probabilities. > This way we can decide which word of a homophone pair like there/their > is