this is my Solr config:
<field name="msg_body" type="text" termVectors="true" indexed="true"
stored="true"/>
and the type text is as configured by default:
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
</analyzer>
</fieldType>
and I have entered quite some stopwords in the stopwords.txt file
my SolrToMahout.sh file:
#!/bin/bash
set -x
cd /store/dev/inst/mahout-0.2
java -classpath
/store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
/store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ /:/g')
org.apache.mahout.utils.vectors.lucene.Driver --dir
/store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
--output /store/dev/inst/mahout-0.2/clustering-example/solr/output
--field msg_body --dictOut
/store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
Best regards,
Bogdan
On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]> wrote:
> What do the relevant pieces of your Solr setup look like and how are you
> invoking the Lucene driver?
>
> -Grant