Re: Stopwords not working as expected

Bogdan Vatkov Sun, 03 Jan 2010 07:36:45 -0800

yes, I can do that.
in the mean time I changed the Driver a little bit to apply the stopwords by
force:


in TFDFMapper
...
private List<String> stopwords;

public TFDFMapper(IndexReader reader, Weight weight, TermInfo termInfo,
File stopwordsFile) {
this.reader = reader;
this.weight = weight;
this.termInfo = termInfo;
this.numDocs = reader.numDocs();
this.stopwords = getContents(stopwordsFile);
}
...
public void map(String term, int frequency, TermVectorOffsetInfo[] offsets,
int[] positions) {
TermEntry entry = termInfo.getTermEntry(field, term);
if (entry != null) {
if (!stopwords.contains(term)) {
vector.setQuick(entry.termIdx, weight.calculate(frequency,
entry.docFreq, numTerms, numDocs));
}
}
}


and in Driver:
...
          String stopwordsFile = cmdLine.getValue(stopwordsOpt).toString();
          VectorMapper mapper = new TFDFMapper(reader, weight, termInfo, new
File(stopwordsFile));
I am currently waiting to see the result clusters.

But you are right I will try to run some smaller set of docs so that I can
debug easily (and share docs).
will come back shortly



On Sun, Jan 3, 2010 at 5:24 PM, Grant Ingersoll <[email protected]> wrote:

>
> On Jan 3, 2010, at 9:13 AM, Bogdan Vatkov wrote:
>
> > Unfortunately it is all classified data I could not share, I will try to
> > debug
>
> Can you reproduce w/ generic documents?
>
> >
> > On Sun, Jan 3, 2010 at 4:10 PM, Grant Ingersoll <[email protected]>
> wrote:
> >
> >> Is there anyway you could zip up a small document set and your Solr home
> >> and post somewhere?
> >>
> >> On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote:
> >>
> >>> Yesterday I had issues with mapping cluster results to dictionary
> entries
> >> -
> >>> it happened that I was using different dictionary - therefore the
> result
> >>> clusters shown really strange results.
> >>> But once I fixed all the commands, input/output files, etc. I got very
> >> good
> >>> result from clusterization POV (I mean clusters are quite correct
> having
> >> in
> >>> mind the input documents) but unfortunately the clusters contained
> mostly
> >>> words which I would like to stop - and which words I placed in the
> >>> stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
> >>>
> >>> Where do you suggest I debug the vector creation? Seems Solr respects
> the
> >>> stopwords but not the vector creation (then clustering).
> >>>
> >>> On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <[email protected]>
> >> wrote:
> >>>
> >>>>
> >>>> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
> >>>>
> >>>>> I have stopwords.txt file with 1200+ words, i did not understand this
> >>>> with
> >>>>> the stemming - you mean my stopwords are somehow ignored due to some
> >>>>> stemming or ?
> >>>>
> >>>> No, stopword removal happens before stemming so it is possible that a
> >> word
> >>>> that was not stopped was then stemmed to a stopword.
> >>>>
> >>>> I thought you said yesterday you got it straightened out.
> >>>>
> >>>>>
> >>>>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <[email protected]
> >
> >>>> wrote:
> >>>>>
> >>>>>> Are you sure you have stopwords and it is not the result of stemming
> >>>> some
> >>>>>> other word?
> >>>>>>
> >>>>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
> >>>>>>
> >>>>>>> my Solr config is like the default one:
> >>>>>>>
> >>>>>>> <field name="msg_body" type="text" termVectors="true"
> indexed="true"
> >>>>>>> stored="true"/>
> >>>>>>>
> >>>>>>> <fieldType name="text" class="solr.TextField"
> >>>>>> positionIncrementGap="100">
> >>>>>>>   <analyzer type="index">
> >>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>             ignoreCase="true"
> >>>>>>>             words="stopwords.txt"
> >>>>>>>             enablePositionIncrements="true"
> >>>>>>>             />
> >>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
> >>>>>> language="English"
> >>>>>>> protected="protwords.txt"/>
> >>>>>>>   </analyzer>
> >>>>>>>   <analyzer type="query">
> >>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>>>     <filter class="solr.SynonymFilterFactory"
> >> synonyms="synonyms.txt"
> >>>>>>> ignoreCase="true" expand="true"/>
> >>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>             ignoreCase="true"
> >>>>>>>             words="stopwords.txt"
> >>>>>>>             enablePositionIncrements="true"
> >>>>>>>             />
> >>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
> >>>>>> language="English"
> >>>>>>> protected="protwords.txt"/>
> >>>>>>>   </analyzer>
> >>>>>>> </fieldType>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards,
> >>>>> Bogdan
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Bogdan
> >>
> >>
> >
> >
> > --
> > Best regards,
> > Bogdan
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Reply via email to