Re: Stopwords work for Solr but not for Mahout

Ted Dunning Sat, 02 Jan 2010 10:14:27 -0800

It can be very helpful to use Luke to view the index to verify thatyou are getting what you think out of your indexing process.


Sent from my iPhone

On Jan 2, 2010, at 8:34 AM, Bogdan Vatkov <[email protected]>wrote:

I re-indexed but I cannot find a way to use the VectorDumper w/Dictionary,I am using mahout v 0.2 and not the very latest trunk code since thelatter

was not compiling and I had to use older code.

On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll<[email protected]> wrote:

I assume you re-indexed and you used the VectorDumper (along with the

dictionary) to dump out the Vectors that were converted andverified no stop

words?

On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:

this is my Solr config:

<field name="msg_body" type="text" termVectors="true"indexed="true"

stored="true"/>

and the type text is as configured by default:

  <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <!-- in this example, we will only use synonyms at query time
      <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
      -->
      <!-- Case insensitive stop word removal.

add enablePositionIncrements=true in both the index andquery

        analyzers to leave a 'gap' for more accurate phrase queries.
      -->
      <filter class="solr.StopFilterFactory"
              ignoreCase="true"
              words="stopwords.txt"
              enablePositionIncrements="true"
              />
      <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory"

language="English"

protected="protwords.txt"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory"synonyms="synonyms.txt"

ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory"
              ignoreCase="true"
              words="stopwords.txt"
              enablePositionIncrements="true"
              />
      <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory"

language="English"

protected="protwords.txt"/>
    </analyzer>
  </fieldType>

and I have entered quite some stopwords in the stopwords.txt file

my SolrToMahout.sh file:

#!/bin/bash
set -x
cd /store/dev/inst/mahout-0.2
java -classpath
/store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
/store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/

/:/g')

org.apache.mahout.utils.vectors.lucene.Driver --dir
/store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
 --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
--field msg_body --dictOut
/store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict

Best regards,
Bogdan

On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll<[email protected]>

wrote:

What do the relevant pieces of your Solr setup look like and howare you
invoking the Lucene driver?

-Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene:
http://www.lucidimagination.com/search



--
Bogdan Vatkov
email: [email protected]

Re: Stopwords work for Solr but not for Mahout

Reply via email to