Worked perfectly for Japanese. I have the same issue with Chinese Analyzer, I am using SmartChinese (lucene-analyzers-smartcn-4.6.0.jar) but I don't see a similar interface as the Japanese analyzer. Is there an easy way to implement the same for Chinese?
On Mon, Mar 10, 2014 at 3:26 PM, Rahul Ratnakar <[email protected]>wrote: > Thanks Robert. This was exactly what I was looking for, will try this. > > > On Mon, Mar 10, 2014 at 3:13 PM, Robert Muir <[email protected]> wrote: > >> You can pass UserDictionary with your own entries to do this. >> >> On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar >> <[email protected]> wrote: >> > Thanks Furkan, This is the exact tool that I am using, albeit in my >> code, I >> > have tried all search modes e.g. >> > >> > new JapaneseAnalyzer(Version.LUCENE_46, null, >> JapaneseTokenizer.Mode.NORMAL, >> > JapaneseAnalyzer.getDefaultStopSet(), >> JapaneseAnalyzer.getDefaultStopTags()) >> > new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode. >> > EXTENDED, JapaneseAnalyzer.getDefaultStopSet(), >> > JapaneseAnalyzer.getDefaultStopTags()) >> > >> > new JapaneseAnalyzer(Version.LUCENE_46, null, >> JapaneseTokenizer.Mode.SEARCH, >> > JapaneseAnalyzer.getDefaultStopSet(), >> > JapaneseAnalyzer.getDefaultStopTags()) >> > >> > >> > >> > and none of them seem to tokenize the words as I want, so was wondering >> if >> > there is some way for me to actually "update" the dictionary/corpus so >> that >> > these slangs are caught by the tokenizer as single word. >> > >> > >> > My example text has been scrapped from an "adult" website, so it might >> be >> > offensive and i apologize for that. A small excerpt from that website:- >> > >> > >> > "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正 >> > 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・" >> > >> > >> > >> > On tokenizing I get the list of tokens below. My problem is that as per >> my >> > in-house japanese language expert, this list breaks up the word "無臭正 " >> > into 無臭 and 正 whereas it should be caught as a single word. :- >> > >> > 裏 >> > >> > びでお >> > >> > 無料 >> > >> > 無臭 >> > >> > 正 >> > >> > 動画 >> > >> > 無料 >> > >> > 無料 >> > >> > a >> > >> > 動画 >> > >> > 裏 >> > >> > びでお >> > >> > 無料 >> > >> > 無臭 >> > >> > 正 >> > >> > 動画 >> > >> > 無料 >> > >> > 無料 >> > >> > a >> > >> > 動画 >> > >> > se >> > >> > く >> > >> > くすい >> > >> > 動画 >> > >> > 無料 >> > >> > 裏 >> > >> > ビデオ >> > >> > ヘンリ >> > >> > 塚本 >> > >> > ウラビデライフ >> > >> > 無料 >> > >> > 動画 >> > >> > セッ >> > >> > く >> > >> > 動画 >> > >> > 無料 >> > >> > >> > Thanks, >> > >> > Rahul >> > >> > >> > >> > >> > >> > On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <[email protected] >> >wrote: >> > >> >> Hi; >> >> >> >> Here is the page of it that has a online Kuromoji tokenizer and >> >> information: http://www.atilika.org/ It may help you. >> >> >> >> Thanks; >> >> Furkan KAMACI >> >> >> >> >> >> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <[email protected]>: >> >> >> >> > I am trying to analyze some japanese web pages for presence of >> >> slang/adult >> >> > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the >> >> > tokenizer breaks up the word into proper words, I am more interested >> in >> >> > catching the slangs which seems to result from combining various >> "safe" >> >> > words. >> >> > >> >> > Few example of words that, as per our in-house japanese language >> >> expert,(I >> >> > have no knowledge of japanese whatsoever) are slangs and should be >> >> caught >> >> > "unbroken" are- >> >> > >> >> > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer >> >> breaks >> >> > it up into 無臭 and 正 which are both apparently safe. >> >> > >> >> > ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own >> but bad >> >> > when combined. >> >> > >> >> > 中出し broken into 中 and 出し, but should have been left as is as it >> >> represents >> >> > a bad phrase. >> >> > >> >> > Any help on how I can use kuromozi tokenizer or any alternatives >> would be >> >> > greatly appreciated. >> >> > >> >> > Thanks. >> >> > >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
