Re: Stopwords work for Solr but not for Mahout

Grant Ingersoll Sat, 02 Jan 2010 12:57:12 -0800

On Jan 2, 2010, at 3:11 PM, Drew Farris wrote:

> I've managed to get k-means clustering working, but I agree it would be very
> nice to have an end-to-end example that would allow others to get up to
> speed quickly. I think the largest holes here are related to the vacuum of a
> corpus of text into the Lucene index and the presentation of a
> human-readable display of the results. It might be interesting to also
> calculate and include some metrics such as the F-measure (in cases where we
> have a reference categorization) and scatter score (in cases where we
> don't).
> 
> The existing LDA example would be a useful starting point. It slurps
> in the Reuters-21578
> corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
> converts it to text, loads it into a Lucene index, extracts vectors from the
> lucene index and runs LDA upon them.
> 
> This example uses the lucene benchmark utilities for the input to text
> conversion and lucene loading. The benchmark utilities code is readable but
> complex. It would be very nice to have a simple piece of code to handle the
> creation of the Lucene index that others can easilly build upon to respond
> to their existing corpus.
>



+1.

I've also got this working for a bunch of RSS feeds using Solr's 
DataImportHandler and the following commands:

In Solr, I setup the DataImportHandler with something like:
<dataConfig>

 <dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/>
        <document>
   <!-- New York Times Sports feed -->
                <entity name="nytSportsFeed"
                                pk="link"
                                url="http://feeds1.nytimes.com/nyt/rss/Sports";
                                processor="XPathEntityProcessor"
                                forEach="/rss/channel | /rss/channel/item"
           dataSource="rss"
       transformer="RegexTransformer,DateFormatTransformer">
                        <field column="source" xpath="/rss/channel/title" 
commonField="true" />
                        <field column="source-link" xpath="/rss/channel/link" 
commonField="true" />
                        <field column="title" xpath="/rss/channel/item/title" />
                        <field column="id" xpath="/rss/channel/item/guid" />
                        <field column="link" xpath="/rss/channel/item/link" />
     <!-- Use the RegexTransformer to strip out ads -->
                        <field column="description" 
xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" 
replaceWith=""/>
                        <field column="category" 
xpath="/rss/channel/item/category" />
     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
     <field column="pubDate" xpath="/rss/channel/item/pubDate" 
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
   </entity>
   <entity name="nytWorld"
                                pk="link"
                                url="http://feeds.nytimes.com/nyt/rss/World";
                                processor="XPathEntityProcessor"
                                forEach="/rss/channel | /rss/channel/item"
           dataSource="rss"
       transformer="RegexTransformer,DateFormatTransformer">
                        <field column="source" xpath="/rss/channel/title" 
commonField="true" />
                        <field column="source-link" xpath="/rss/channel/link" 
commonField="true" />
                        <field column="title" xpath="/rss/channel/item/title" />
                        <field column="id" xpath="/rss/channel/item/guid" />
                        <field column="link" xpath="/rss/channel/item/link" />
     <!-- Use the RegexTransformer to strip out ads -->
                        <field column="description" 
xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;" 
replaceWith=""/>
                        <field column="category" 
xpath="/rss/channel/item/category" />
     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
     <field column="pubDate" xpath="/rss/channel/item/pubDate" 
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
   </entity>

</dataConfig>

Then in my browser: 
http://localhost:8983/solr/dataimport?command=full-import&clean=true

Then on the command line in Mahout home:
> mvn dependecy:copy-dependencies
> cd target/dependency
> java -cp "*" org.apache.mahout.utils.vectors.lucene.Driver --dir [path to 
> index]/data/index/ --output ./solr-clust-n2/part-out.vec --field 
> desc-clustering --idField id --dictOut ./solr-clust-n2/dictionary.txt --norm 2
> java -Xmx1024M -cp "*" org.apache.mahout.clustering.kmeans.KMeansDriver 
> --input ./solr-clust-n2/part-out.vec --clusters ./solr-clust-n2/out/clusters  
> --output ./solr-clust-n2/out/ --distance 
> org.apache.mahout.common.distance.CosineDistanceMeasure --convergence 0.001 
> --overwrite --k 25
> java -Xmx1024M -cp "*" org.apache.mahout.utils.clustering.ClusterDumper 
> --seqFileDir ./solr-clust-n2/out/clusters-2  --dictionary 
> ./solr-clust-n2/dictionary.txt  --substring 100 --pointsDir 
> ./solr-clust-n2/out/points/
or:
> java -Xmx1024M -cp "*" org.apache.mahout.utils.vectors.lucene.ClusterLabels 
> --dir [path to index]/data/index/ --field description --idField id 
> --seqFileDir ./solr-clust-n2/out/clusters-2  --pointsDir 
> ./solr-clust-n2/out/points/ --minClusterSize 5 --maxLabels 10


-Grant

Re: Stopwords work for Solr but not for Mahout

Reply via email to