On Jan 2, 2010, at 11:56 AM, Bogdan Vatkov wrote: > If I use the TermVectorComponent the search results do not contain stopwords > - which seems to be ok at this point in time. > But when I use the Lucene Driver I can see the stop words in the dictionary > file alone and later in the clusters. > Is there a way that I can print the vectors with the real terms in place - > instead of just some indexes?
No, but it should be easy enough to modify ClusterDumper to do this. Let me double check mine, I'm pretty sure I don't have stopwords, but I haven't checked all the way down to the actual vector. All the Lucene Driver is doing is loading the term vector, so if it isn't in the term vector, I don't see how it can be in the Mahout vector. Could be a bug, though. > > On Sat, Jan 2, 2010 at 6:40 PM, Grant Ingersoll <[email protected]> wrote: > >> >> On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote: >> >>> I re-indexed but I cannot find a way to use the VectorDumper w/ >> Dictionary, >>> I am using mahout v 0.2 and not the very latest trunk code since the >> latter >>> was not compiling and I had to use older code. >> >> Hmm, I'm using trunk and it is compiling. You have to do "mvn install" >> from the root Mahout dir, if that helps at all. >> >> If you turn on the TermVectorComponent ( >> http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your >> vectors look like? Do they have stopwords? >> >>> >>> On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]> >> wrote: >>> >>>> I assume you re-indexed and you used the VectorDumper (along with the >>>> dictionary) to dump out the Vectors that were converted and verified no >> stop >>>> words? >>>> >>>> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote: >>>> >>>>> this is my Solr config: >>>>> >>>>> <field name="msg_body" type="text" termVectors="true" indexed="true" >>>>> stored="true"/> >>>>> >>>>> and the type text is as configured by default: >>>>> >>>>> <fieldType name="text" class="solr.TextField" >>>>> positionIncrementGap="100"> >>>>> <analyzer type="index"> >>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>> <!-- in this example, we will only use synonyms at query time >>>>> <filter class="solr.SynonymFilterFactory" >>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> >>>>> --> >>>>> <!-- Case insensitive stop word removal. >>>>> add enablePositionIncrements=true in both the index and query >>>>> analyzers to leave a 'gap' for more accurate phrase queries. >>>>> --> >>>>> <filter class="solr.StopFilterFactory" >>>>> ignoreCase="true" >>>>> words="stopwords.txt" >>>>> enablePositionIncrements="true" >>>>> /> >>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> <filter class="solr.SnowballPorterFilterFactory" >>>> language="English" >>>>> protected="protwords.txt"/> >>>>> </analyzer> >>>>> <analyzer type="query"> >>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >>>>> ignoreCase="true" expand="true"/> >>>>> <filter class="solr.StopFilterFactory" >>>>> ignoreCase="true" >>>>> words="stopwords.txt" >>>>> enablePositionIncrements="true" >>>>> /> >>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> <filter class="solr.SnowballPorterFilterFactory" >>>> language="English" >>>>> protected="protwords.txt"/> >>>>> </analyzer> >>>>> </fieldType> >>>>> >>>>> and I have entered quite some stopwords in the stopwords.txt file >>>>> >>>>> my SolrToMahout.sh file: >>>>> >>>>> #!/bin/bash >>>>> set -x >>>>> cd /store/dev/inst/mahout-0.2 >>>>> java -classpath >>>>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo >>>>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/ >>>> /:/g') >>>>> org.apache.mahout.utils.vectors.lucene.Driver --dir >>>>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \ >>>>> --output /store/dev/inst/mahout-0.2/clustering-example/solr/output >>>>> --field msg_body --dictOut >>>>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict >>>>> >>>>> Best regards, >>>>> Bogdan >>>>> >>>>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]> >>>> wrote: >>>>> >>>>>> What do the relevant pieces of your Solr setup look like and how are >> you >>>>>> invoking the Lucene driver? >>>>>> >>>>>> -Grant >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.com/ >>>> >>>> Search the Lucene ecosystem using Solr/Lucene: >>>> http://www.lucidimagination.com/search >>>> >>>> >>> >>> >>> -- >>> Bogdan Vatkov >>> email: [email protected] >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem using Solr/Lucene: >> http://www.lucidimagination.com/search >> >> > > > -- > Bogdan Vatkov > email: [email protected] -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
