Thanks Robert. This was exactly what I was looking for, will try this.
On Mon, Mar 10, 2014 at 3:13 PM, Robert Muir <[email protected]> wrote: > You can pass UserDictionary with your own entries to do this. > > On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar > <[email protected]> wrote: > > Thanks Furkan, This is the exact tool that I am using, albeit in my > code, I > > have tried all search modes e.g. > > > > new JapaneseAnalyzer(Version.LUCENE_46, null, > JapaneseTokenizer.Mode.NORMAL, > > JapaneseAnalyzer.getDefaultStopSet(), > JapaneseAnalyzer.getDefaultStopTags()) > > new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode. > > EXTENDED, JapaneseAnalyzer.getDefaultStopSet(), > > JapaneseAnalyzer.getDefaultStopTags()) > > > > new JapaneseAnalyzer(Version.LUCENE_46, null, > JapaneseTokenizer.Mode.SEARCH, > > JapaneseAnalyzer.getDefaultStopSet(), > > JapaneseAnalyzer.getDefaultStopTags()) > > > > > > > > and none of them seem to tokenize the words as I want, so was wondering > if > > there is some way for me to actually "update" the dictionary/corpus so > that > > these slangs are caught by the tokenizer as single word. > > > > > > My example text has been scrapped from an "adult" website, so it might be > > offensive and i apologize for that. A small excerpt from that website:- > > > > > > "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正 > > 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・" > > > > > > > > On tokenizing I get the list of tokens below. My problem is that as per > my > > in-house japanese language expert, this list breaks up the word "無臭正 " > > into 無臭 and 正 whereas it should be caught as a single word. :- > > > > 裏 > > > > びでお > > > > 無料 > > > > 無臭 > > > > 正 > > > > 動画 > > > > 無料 > > > > 無料 > > > > a > > > > 動画 > > > > 裏 > > > > びでお > > > > 無料 > > > > 無臭 > > > > 正 > > > > 動画 > > > > 無料 > > > > 無料 > > > > a > > > > 動画 > > > > se > > > > く > > > > くすい > > > > 動画 > > > > 無料 > > > > 裏 > > > > ビデオ > > > > ヘンリ > > > > 塚本 > > > > ウラビデライフ > > > > 無料 > > > > 動画 > > > > セッ > > > > く > > > > 動画 > > > > 無料 > > > > > > Thanks, > > > > Rahul > > > > > > > > > > > > On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <[email protected] > >wrote: > > > >> Hi; > >> > >> Here is the page of it that has a online Kuromoji tokenizer and > >> information: http://www.atilika.org/ It may help you. > >> > >> Thanks; > >> Furkan KAMACI > >> > >> > >> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <[email protected]>: > >> > >> > I am trying to analyze some japanese web pages for presence of > >> slang/adult > >> > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the > >> > tokenizer breaks up the word into proper words, I am more interested > in > >> > catching the slangs which seems to result from combining various > "safe" > >> > words. > >> > > >> > Few example of words that, as per our in-house japanese language > >> expert,(I > >> > have no knowledge of japanese whatsoever) are slangs and should be > >> caught > >> > "unbroken" are- > >> > > >> > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer > >> breaks > >> > it up into 無臭 and 正 which are both apparently safe. > >> > > >> > ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but > bad > >> > when combined. > >> > > >> > 中出し broken into 中 and 出し, but should have been left as is as it > >> represents > >> > a bad phrase. > >> > > >> > Any help on how I can use kuromozi tokenizer or any alternatives > would be > >> > greatly appreciated. > >> > > >> > Thanks. > >> > > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
