Need to set outputUnigrams = false with something like:
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_43,
reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);
tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);
TokenFilter sf = new ShingleFilter(tokenStream, 3,3);
((ShingleFilter)sf).setOutputUnigrams(false);
sf = new
StopFilter(Version.LUCENE_43,sf,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new Analyzer.TokenStreamComponents(source, sf);
Not sure the stopFilter will do you any good if you're extracting only trigrams.
-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Malgorzata Urbanska
Sent: Thursday, July 18, 2013 6:02 PM
To: [email protected]
Subject: ShingleFilter
Hello,
For some time I have been trying to apply ShingleFilter. I have a string:
"The users get program in the User RPC API in Apache Rave"
and I would like to get:
[the users get] [users get program] [get program in] [program in
the] [in the user] [the user rpc] [user rpc api] [rpc api in] [api in
apache] [in apache rave][apache rave 0.11]
however I'm getting :
[the users get] [users] [users get program] [get] [get program in]
[program] [program in the] [in the user] [the user rpc] [user] [user
rpc api] [rpc] [rpc api in] [api] [api in apache] [in apache rave]
[apache] [apache rave 0.11] [rave]
part of my code:
protected TokenStreamComponents createComponents(String fieldName,
Reader reader){
StandardTokenizer source = new
StandardTokenizer(Version.LUCENE_43, reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);
tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);
tokenStream = new ShingleFilter(tokenStream,3,3);
tokenStream = new
StopFilter(Version.LUCENE_43,tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(source, tokenStream)
could please, somebody explain me why I'm getting single shinglers
when I set min size 3.
Thanks,
--
gosia
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]