Hi Bill,
I can think of two possible interpretations of "removing filler tokens":
1. Don't create shingles across stopwords, e.g. for text "one two three four
five" and stopword "three", bigrams only, you'd get ("one two", "four five"),
instead of the current ("one two", "two _", "_ four", "four five").
2. Create shingles as if the stopwords were never there, e.g. for the same text
and stopword, bigrams only, you'd get ("one two", "two four", "four five").
Which one did you have in mind? #2 can be achieved by adding PositionFilter
after StopFilter and before ShingleFilter. I think #1 requires ShingleFilter
modifications.
Steve
> -----Original Message-----
> From: William Koscho [mailto:[email protected]]
> Sent: Wednesday, May 11, 2011 12:05 AM
> To: [email protected]
> Subject: Can I omit ShingleFilter's filler tokens
>
> Hi,
>
> Can I remove the filler token _ from the n-gram-tokens that are generated
> by
> a ShingleFilter?
>
> I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter,
> and ShingleFilter to create phrase n-grams. The ShingleFilter inserts
> FILLER_TOKENs in place of the stopwords, but I don't want them.
>
> How can I omit the filler tokens?
>
> thanks
> Bill