Thanks Furkan, This is the exact tool that I am using, albeit in my code, I have tried all search modes e.g.
new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.NORMAL, JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags()) new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode. EXTENDED, JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags()) new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.SEARCH, JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags()) and none of them seem to tokenize the words as I want, so was wondering if there is some way for me to actually "update" the dictionary/corpus so that these slangs are caught by the tokenizer as single word. My example text has been scrapped from an "adult" website, so it might be offensive and i apologize for that. A small excerpt from that website:- "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・" On tokenizing I get the list of tokens below. My problem is that as per my in-house japanese language expert, this list breaks up the word "無臭正 " into 無臭 and 正 whereas it should be caught as a single word. :- 裏 びでお 無料 無臭 正 動画 無料 無料 a 動画 裏 びでお 無料 無臭 正 動画 無料 無料 a 動画 se く くすい 動画 無料 裏 ビデオ ヘンリ 塚本 ウラビデライフ 無料 動画 セッ く 動画 無料 Thanks, Rahul On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <[email protected]>wrote: > Hi; > > Here is the page of it that has a online Kuromoji tokenizer and > information: http://www.atilika.org/ It may help you. > > Thanks; > Furkan KAMACI > > > 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <[email protected]>: > > > I am trying to analyze some japanese web pages for presence of > slang/adult > > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the > > tokenizer breaks up the word into proper words, I am more interested in > > catching the slangs which seems to result from combining various "safe" > > words. > > > > Few example of words that, as per our in-house japanese language > expert,(I > > have no knowledge of japanese whatsoever) are slangs and should be > caught > > "unbroken" are- > > > > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer > breaks > > it up into 無臭 and 正 which are both apparently safe. > > > > ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad > > when combined. > > > > 中出し broken into 中 and 出し, but should have been left as is as it > represents > > a bad phrase. > > > > Any help on how I can use kuromozi tokenizer or any alternatives would be > > greatly appreciated. > > > > Thanks. > > >
