Re: Stopwords not working as expected

Grant Ingersoll Sun, 03 Jan 2010 07:24:57 -0800

On Jan 3, 2010, at 9:13 AM, Bogdan Vatkov wrote:

> Unfortunately it is all classified data I could not share, I will try to
> debug


Can you reproduce w/ generic documents?

> 
> On Sun, Jan 3, 2010 at 4:10 PM, Grant Ingersoll <[email protected]> wrote:
> 
>> Is there anyway you could zip up a small document set and your Solr home
>> and post somewhere?
>> 
>> On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote:
>> 
>>> Yesterday I had issues with mapping cluster results to dictionary entries
>> -
>>> it happened that I was using different dictionary - therefore the result
>>> clusters shown really strange results.
>>> But once I fixed all the commands, input/output files, etc. I got very
>> good
>>> result from clusterization POV (I mean clusters are quite correct having
>> in
>>> mind the input documents) but unfortunately the clusters contained mostly
>>> words which I would like to stop - and which words I placed in the
>>> stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
>>> 
>>> Where do you suggest I debug the vector creation? Seems Solr respects the
>>> stopwords but not the vector creation (then clustering).
>>> 
>>> On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <[email protected]>
>> wrote:
>>> 
>>>> 
>>>> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
>>>> 
>>>>> I have stopwords.txt file with 1200+ words, i did not understand this
>>>> with
>>>>> the stemming - you mean my stopwords are somehow ignored due to some
>>>>> stemming or ?
>>>> 
>>>> No, stopword removal happens before stemming so it is possible that a
>> word
>>>> that was not stopped was then stemmed to a stopword.
>>>> 
>>>> I thought you said yesterday you got it straightened out.
>>>> 
>>>>> 
>>>>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> Are you sure you have stopwords and it is not the result of stemming
>>>> some
>>>>>> other word?
>>>>>> 
>>>>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
>>>>>> 
>>>>>>> my Solr config is like the default one:
>>>>>>> 
>>>>>>> <field name="msg_body" type="text" termVectors="true" indexed="true"
>>>>>>> stored="true"/>
>>>>>>> 
>>>>>>> <fieldType name="text" class="solr.TextField"
>>>>>> positionIncrementGap="100">
>>>>>>>   <analyzer type="index">
>>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>             ignoreCase="true"
>>>>>>>             words="stopwords.txt"
>>>>>>>             enablePositionIncrements="true"
>>>>>>>             />
>>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
>>>>>> language="English"
>>>>>>> protected="protwords.txt"/>
>>>>>>>   </analyzer>
>>>>>>>   <analyzer type="query">
>>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>     <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt"
>>>>>>> ignoreCase="true" expand="true"/>
>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>             ignoreCase="true"
>>>>>>>             words="stopwords.txt"
>>>>>>>             enablePositionIncrements="true"
>>>>>>>             />
>>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
>>>>>> language="English"
>>>>>>> protected="protwords.txt"/>
>>>>>>>   </analyzer>
>>>>>>> </fieldType>
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> Bogdan
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> Bogdan
>> 
>> 
> 
> 
> -- 
> Best regards,
> Bogdan

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Stopwords not working as expected

Reply via email to