I assume you re-indexed and you used the VectorDumper (along with the dictionary) to dump out the Vectors that were converted and verified no stop words?
On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote: > this is my Solr config: > > <field name="msg_body" type="text" termVectors="true" indexed="true" > stored="true"/> > > and the type text is as configured by default: > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- in this example, we will only use synonyms at query time > <filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > --> > <!-- Case insensitive stop word removal. > add enablePositionIncrements=true in both the index and query > analyzers to leave a 'gap' for more accurate phrase queries. > --> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true" > /> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SnowballPorterFilterFactory" language="English" > protected="protwords.txt"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true" > /> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SnowballPorterFilterFactory" language="English" > protected="protwords.txt"/> > </analyzer> > </fieldType> > > and I have entered quite some stopwords in the stopwords.txt file > > my SolrToMahout.sh file: > > #!/bin/bash > set -x > cd /store/dev/inst/mahout-0.2 > java -classpath > /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo > /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ /:/g') > org.apache.mahout.utils.vectors.lucene.Driver --dir > /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \ > --output /store/dev/inst/mahout-0.2/clustering-example/solr/output > --field msg_body --dictOut > /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict > > Best regards, > Bogdan > > On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]> wrote: > >> What do the relevant pieces of your Solr setup look like and how are you >> invoking the Lucene driver? >> >> -Grant -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
