Re: Stopwords work for Solr but not for Mahout

Grant Ingersoll Sat, 02 Jan 2010 09:24:53 -0800

On Jan 2, 2010, at 11:56 AM, Bogdan Vatkov wrote:

> If I use the TermVectorComponent the search results do not contain stopwords
> - which seems to be ok at this point in time.
> But when I use the Lucene Driver I can see the stop words in the dictionary
> file alone and later in the clusters.
> Is there a way that I can print the vectors with the real terms in place -
> instead of just some indexes?


No, but it should be easy enough to modify ClusterDumper to do this.  

Let me double check mine, I'm pretty sure I don't have stopwords, but I haven't 
checked all the way down to the actual vector.  All the Lucene Driver is doing 
is loading the term vector, so if it isn't in the term vector, I don't see how 
it can be in the Mahout vector.  Could be a bug, though.

> 
> On Sat, Jan 2, 2010 at 6:40 PM, Grant Ingersoll <[email protected]> wrote:
> 
>> 
>> On Jan 2, 2010, at 11:34 AM, Bogdan Vatkov wrote:
>> 
>>> I re-indexed but I cannot find a way to use the VectorDumper w/
>> Dictionary,
>>> I am using mahout v 0.2 and not the very latest trunk code since the
>> latter
>>> was not compiling and I had to use older code.
>> 
>> Hmm, I'm using trunk and it is compiling.  You have to do "mvn install"
>> from the root Mahout dir, if that helps at all.
>> 
>> If you turn on the TermVectorComponent (
>> http://wiki.apache.org/solr/TermVectorComponent) in Solr, what do your
>> vectors look like?  Do they have stopwords?
>> 
>>> 
>>> On Sat, Jan 2, 2010 at 5:54 PM, Grant Ingersoll <[email protected]>
>> wrote:
>>> 
>>>> I assume you re-indexed and you used the VectorDumper (along with the
>>>> dictionary) to dump out the Vectors that were converted and verified no
>> stop
>>>> words?
>>>> 
>>>> On Jan 2, 2010, at 9:03 AM, Bogdan Vatkov wrote:
>>>> 
>>>>> this is my Solr config:
>>>>> 
>>>>> <field name="msg_body" type="text" termVectors="true" indexed="true"
>>>>> stored="true"/>
>>>>> 
>>>>> and the type text is as configured by default:
>>>>> 
>>>>>  <fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>>    <analyzer type="index">
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>      <!-- in this example, we will only use synonyms at query time
>>>>>      <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>>>      -->
>>>>>      <!-- Case insensitive stop word removal.
>>>>>        add enablePositionIncrements=true in both the index and query
>>>>>        analyzers to leave a 'gap' for more accurate phrase queries.
>>>>>      -->
>>>>>      <filter class="solr.StopFilterFactory"
>>>>>              ignoreCase="true"
>>>>>              words="stopwords.txt"
>>>>>              enablePositionIncrements="true"
>>>>>              />
>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"
>>>>> protected="protwords.txt"/>
>>>>>    </analyzer>
>>>>>    <analyzer type="query">
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="true"/>
>>>>>      <filter class="solr.StopFilterFactory"
>>>>>              ignoreCase="true"
>>>>>              words="stopwords.txt"
>>>>>              enablePositionIncrements="true"
>>>>>              />
>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"
>>>>> protected="protwords.txt"/>
>>>>>    </analyzer>
>>>>>  </fieldType>
>>>>> 
>>>>> and I have entered quite some stopwords in the stopwords.txt file
>>>>> 
>>>>> my SolrToMahout.sh file:
>>>>> 
>>>>> #!/bin/bash
>>>>> set -x
>>>>> cd /store/dev/inst/mahout-0.2
>>>>> java -classpath
>>>>> /store/dev/inst/mahout-0.2/utils/target/mahout-utils-0.2.jar:$( echo
>>>>> /store/dev/inst/mahout-0.2/utils/target/dependency/*.jar . | sed 's/
>>>> /:/g')
>>>>> org.apache.mahout.utils.vectors.lucene.Driver --dir
>>>>> /store/dev/inst/apache-solr-1.4.0/example/solr/data/index \
>>>>> --output /store/dev/inst/mahout-0.2/clustering-example/solr/output
>>>>> --field msg_body --dictOut
>>>>> /store/dev/inst/mahout-0.2/clustering-example/solr_dict/dict
>>>>> 
>>>>> Best regards,
>>>>> Bogdan
>>>>> 
>>>>> On Sat, Jan 2, 2010 at 3:49 PM, Grant Ingersoll <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> What do the relevant pieces of your Solr setup look like and how are
>> you
>>>>>> invoking the Lucene driver?
>>>>>> 
>>>>>> -Grant
>>>> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>> 
>>>> Search the Lucene ecosystem using Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Bogdan Vatkov
>>> email: [email protected]
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>> 
> 
> 
> -- 
> Bogdan Vatkov
> email: [email protected]

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Stopwords work for Solr but not for Mahout

Reply via email to