I think of stopwords as an on/off scoring feature, the word is there or not. 
IDF is a proportional scoring feature, stopwords count less than more 
meaningful words.

When using IDF, we don’t need to also use stopwords.

Also, IDF automatically finds the common words. Is the word “copyright” on 
every single document that is indexed? If so, IDF will score that word as 
stopword-like.

Yes, there can still be performance issues with very common words that are also 
used in queries.

If you do want to use stopwords, I’d index without them, then look at the words 
with the lowest IDF to make the list.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)


> On Sep 4, 2016, at 8:11 AM, Erick Erickson <[email protected]> wrote:
> 
> I can argue both ways as usual. Stopwords may have started as a way to help 
> deal with limited space/memory, but are things really any different now? We 
> just shove more and more data into the system and still have hardware 
> constraints to deal with that can be helped by squeezing out stopwords.
> 
> OTOH, how much time and energy do we spend trying to support them? Hmmm, 
> maybe the right thing to do is reconsider how they work. It seems like the 
> pain of supporting them is a consequence of them being a filter, then we get 
> into whether to preserve pos info and the like. Would it be easier if we 
> thought of them as pre-processing before any analysis chain even saw them? It 
> sure would be easier to explain as "it's as if they never existed" than the 
> present "it depends". This would certainly change behavior though....
> 
> 
> On Aug 29, 2016 18:36, "Walter Underwood" <[email protected] 
> <mailto:[email protected]>> wrote:
> I’ve never removed stopwords and I started working on search in 1996 at 
> Infoseek.
> 
> wunder
> Walter Underwood
> [email protected] <mailto:[email protected]>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Aug 29, 2016, at 6:32 PM, Alexandre Rafalovitch <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> On 30 August 2016 at 08:18, Walter Underwood <[email protected] 
>> <mailto:[email protected]>>
>> wrote (on Solr users list):
>>> Stop word removal is a hack left over from when we were running search 
>>> engines in 64 kbytes of memory.
>> 
>> If this is a leftover hack, should we start removing it from the
>> official examples?
>> 
>> Or do they still have value even with latest ranking algorithms?
>> 
>> Regards,
>>   Alex.
>> ----
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/ <http://www.solr-start.com/>
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] 
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected] 
>> <mailto:[email protected]>
>> 
> 

Reply via email to