If I use the TermVectorComponent the search results do not contain stopwords - which seems to be ok at this point in time. But when I use the Lucene Driver I can see the stop words in the dictionary file alone and later in the clusters. Is there a way that I can print the vectors with the real terms in place - instead of just some indexes?
On Sat, Jan 2, 2010 at 6:40 PM, Grant Ingersoll <[email protected]> wrote: > > On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote: > > > I re-indexed but I cannot find a way to use the VectorDumper w/ > Dictionary, > > I am using mahout v 0.2 and not the very latest trunk code since the > latter > > was not compiling and I had to use older code. > > Hmm, I'm using trunk and it is compiling. You have to do "mvn install" > from the root Mahout dir, if that helps at all. > > If you turn on the TermVectorComponent ( > http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your > vectors look like? Do they have stopwords? > > > > > On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]> > wrote: > > > >> I assume you re-indexed and you used the VectorDumper (along with the > >> dictionary) to dump out the Vectors that were converted and verified no > stop > >> words? > >> > >> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote: > >> > >>> this is my Solr config: > >>> > >>> <field name="msg_body" type="text" termVectors="true" indexed="true" > >>> stored="true"/> > >>> > >>> and the type text is as configured by default: > >>> > >>> <fieldType name="text" class="solr.TextField" > >>> positionIncrementGap="100"> > >>> <analyzer type="index"> > >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>> <!-- in this example, we will only use synonyms at query time > >>> <filter class="solr.SynonymFilterFactory" > >>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > >>> --> > >>> <!-- Case insensitive stop word removal. > >>> add enablePositionIncrements=true in both the index and query > >>> analyzers to leave a 'gap' for more accurate phrase queries. > >>> --> > >>> <filter class="solr.StopFilterFactory" > >>> ignoreCase="true" > >>> words="stopwords.txt" > >>> enablePositionIncrements="true" > >>> /> > >>> <filter class="solr.WordDelimiterFilterFactory" > >>> generateWordParts="1" generateNumberParts="1" catenateWords="1" > >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.SnowballPorterFilterFactory" > >> language="English" > >>> protected="protwords.txt"/> > >>> </analyzer> > >>> <analyzer type="query"> > >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > >>> ignoreCase="true" expand="true"/> > >>> <filter class="solr.StopFilterFactory" > >>> ignoreCase="true" > >>> words="stopwords.txt" > >>> enablePositionIncrements="true" > >>> /> > >>> <filter class="solr.WordDelimiterFilterFactory" > >>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > >>> <filter class="solr.LowerCaseFilterFactory"/> > >>> <filter class="solr.SnowballPorterFilterFactory" > >> language="English" > >>> protected="protwords.txt"/> > >>> </analyzer> > >>> </fieldType> > >>> > >>> and I have entered quite some stopwords in the stopwords.txt file > >>> > >>> my SolrToMahout.sh file: > >>> > >>> #!/bin/bash > >>> set -x > >>> cd /store/dev/inst/mahout-0.2 > >>> java -classpath > >>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo > >>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ > >> /:/g') > >>> org.apache.mahout.utils.vectors.lucene.Driver --dir > >>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \ > >>> --output /store/dev/inst/mahout-0.2/clustering-example/solr/output > >>> --field msg_body --dictOut > >>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict > >>> > >>> Best regards, > >>> Bogdan > >>> > >>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]> > >> wrote: > >>> > >>>> What do the relevant pieces of your Solr setup look like and how are > you > >>>> invoking the Lucene driver? > >>>> > >>>> -Grant > >> > >> -------------------------- > >> Grant Ingersoll > >> http://www.lucidimagination.com/ > >> > >> Search the Lucene ecosystem using Solr/Lucene: > >> http://www.lucidimagination.com/search > >> > >> > > > > > > -- > > Bogdan Vatkov > > email: [email protected] > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Bogdan Vatkov email: [email protected]
