Is there anyway you could zip up a small document set and your Solr home and post somewhere?
On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote: > Yesterday I had issues with mapping cluster results to dictionary entries - > it happened that I was using different dictionary - therefore the result > clusters shown really strange results. > But once I fixed all the commands, input/output files, etc. I got very good > result from clusterization POV (I mean clusters are quite correct having in > mind the input documents) but unfortunately the clusters contained mostly > words which I would like to stop - and which words I placed in the > stopwords.txt in Solr (re-indexed, restarted Solr, etc.). > > Where do you suggest I debug the vector creation? Seems Solr respects the > stopwords but not the vector creation (then clustering). > > On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <[email protected]> wrote: > >> >> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote: >> >>> I have stopwords.txt file with 1200+ words, i did not understand this >> with >>> the stemming - you mean my stopwords are somehow ignored due to some >>> stemming or ? >> >> No, stopword removal happens before stemming so it is possible that a word >> that was not stopped was then stemmed to a stopword. >> >> I thought you said yesterday you got it straightened out. >> >>> >>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <[email protected]> >> wrote: >>> >>>> Are you sure you have stopwords and it is not the result of stemming >> some >>>> other word? >>>> >>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote: >>>> >>>>> my Solr config is like the default one: >>>>> >>>>> <field name="msg_body" type="text" termVectors="true" indexed="true" >>>>> stored="true"/> >>>>> >>>>> <fieldType name="text" class="solr.TextField" >>>> positionIncrementGap="100"> >>>>> <analyzer type="index"> >>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>> <filter class="solr.StopFilterFactory" >>>>> ignoreCase="true" >>>>> words="stopwords.txt" >>>>> enablePositionIncrements="true" >>>>> /> >>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> <filter class="solr.SnowballPorterFilterFactory" >>>> language="English" >>>>> protected="protwords.txt"/> >>>>> </analyzer> >>>>> <analyzer type="query"> >>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >>>>> ignoreCase="true" expand="true"/> >>>>> <filter class="solr.StopFilterFactory" >>>>> ignoreCase="true" >>>>> words="stopwords.txt" >>>>> enablePositionIncrements="true" >>>>> /> >>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> <filter class="solr.SnowballPorterFilterFactory" >>>> language="English" >>>>> protected="protwords.txt"/> >>>>> </analyzer> >>>>> </fieldType> >>>> >>>> >>> >>> >>> -- >>> Best regards, >>> Bogdan >> >> > > > -- > Best regards, > Bogdan
