It can be very helpful to use Luke to view the index to verify that you are getting what you think out of your indexing process.

Sent from my iPhone

On Jan 2, 2010, at 8:34 AM, Bogdan Vatkov <[email protected]> wrote:

I re-indexed but I cannot find a way to use the VectorDumper w/ Dictionary, I am using mahout v 0.2 and not the very latest trunk code since the latter
was not compiling and I had to use older code.

On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]> wrote:

I assume you re-indexed and you used the VectorDumper (along with the
dictionary) to dump out the Vectors that were converted and verified no stop
words?

On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:

this is my Solr config:

<field name="msg_body" type="text" termVectors="true" indexed="true"
stored="true"/>

and the type text is as configured by default:

  <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <!-- in this example, we will only use synonyms at query time
      <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
      -->
      <!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
        analyzers to leave a 'gap' for more accurate phrase queries.
      -->
      <filter class="solr.StopFilterFactory"
              ignoreCase="true"
              words="stopwords.txt"
              enablePositionIncrements="true"
              />
      <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory"
              ignoreCase="true"
              words="stopwords.txt"
              enablePositionIncrements="true"
              />
      <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>
    </analyzer>
  </fieldType>

and I have entered quite some stopwords in the stopwords.txt file

my SolrToMahout.sh file:

#!/bin/bash
set -x
cd /store/dev/inst/mahout-0.2
java -classpath
/store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
/store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
/:/g')
org.apache.mahout.utils.vectors.lucene.Driver --dir
/store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
 --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
--field msg_body --dictOut
/store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict

Best regards,
Bogdan

On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]>
wrote:

What do the relevant pieces of your Solr setup look like and how are you
invoking the Lucene driver?

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene:
http://www.lucidimagination.com/search




--
Bogdan Vatkov
email: [email protected]

Reply via email to