It can be very helpful to use Luke to view the index to verify that
you are getting what you think out of your indexing process.
Sent from my iPhone
On Jan 2, 2010, at 8:34 AM, Bogdan Vatkov <[email protected]>
wrote:
I re-indexed but I cannot find a way to use the VectorDumper w/
Dictionary,
I am using mahout v 0.2 and not the very latest trunk code since the
latter
was not compiling and I had to use older code.
On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll
<[email protected]> wrote:
I assume you re-indexed and you used the VectorDumper (along with the
dictionary) to dump out the Vectors that were converted and
verified no stop
words?
On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
this is my Solr config:
<field name="msg_body" type="text" termVectors="true"
indexed="true"
stored="true"/>
and the type text is as configured by default:
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and
query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>
</analyzer>
</fieldType>
and I have entered quite some stopwords in the stopwords.txt file
my SolrToMahout.sh file:
#!/bin/bash
set -x
cd /store/dev/inst/mahout-0.2
java -classpath
/store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
/store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
/:/g')
org.apache.mahout.utils.vectors.lucene.Driver --dir
/store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
--output /store/dev/inst/mahout-0.2/clustering-example/solr/output
--field msg_body --dictOut
/store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
Best regards,
Bogdan
On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll
<[email protected]>
wrote:
What do the relevant pieces of your Solr setup look like and how
are you
invoking the Lucene driver?
-Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene:
http://www.lucidimagination.com/search
--
Bogdan Vatkov
email: [email protected]