On Jan 2, 2010, at 3:11 PM, Drew Farris wrote:
> I've managed to get k-means clustering working, but I agree it would be very
> nice to have an end-to-end example that would allow others to get up to
> speed quickly. I think the largest holes here are related to the vacuum of a
> corpus of text into the Lucene index and the presentation of a
> human-readable display of the results. It might be interesting to also
> calculate and include some metrics such as the F-measure (in cases where we
> have a reference categorization) and scatter score (in cases where we
> don't).
>
> The existing LDA example would be a useful starting point. It slurps
> in the Reuters-21578
> corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>,
> converts it to text, loads it into a Lucene index, extracts vectors from the
> lucene index and runs LDA upon them.
>
> This example uses the lucene benchmark utilities for the input to text
> conversion and lucene loading. The benchmark utilities code is readable but
> complex. It would be very nice to have a simple piece of code to handle the
> creation of the Lucene index that others can easilly build upon to respond
> to their existing corpus.
>
+1.
I've also got this working for a bunch of RSS feeds using Solr's
DataImportHandler and the following commands:
In Solr, I setup the DataImportHandler with something like:
<dataConfig>
<dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/>
<document>
<!-- New York Times Sports feed -->
<entity name="nytSportsFeed"
pk="link"
url="http://feeds1.nytimes.com/nyt/rss/Sports"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
dataSource="rss"
transformer="RegexTransformer,DateFormatTransformer">
<field column="source" xpath="/rss/channel/title"
commonField="true" />
<field column="source-link" xpath="/rss/channel/link"
commonField="true" />
<field column="title" xpath="/rss/channel/item/title" />
<field column="id" xpath="/rss/channel/item/guid" />
<field column="link" xpath="/rss/channel/item/link" />
<!-- Use the RegexTransformer to strip out ads -->
<field column="description"
xpath="/rss/channel/item/description" regex="<a.*?</a>"
replaceWith=""/>
<field column="category"
xpath="/rss/channel/item/category" />
<!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
<field column="pubDate" xpath="/rss/channel/item/pubDate"
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
</entity>
<entity name="nytWorld"
pk="link"
url="http://feeds.nytimes.com/nyt/rss/World"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
dataSource="rss"
transformer="RegexTransformer,DateFormatTransformer">
<field column="source" xpath="/rss/channel/title"
commonField="true" />
<field column="source-link" xpath="/rss/channel/link"
commonField="true" />
<field column="title" xpath="/rss/channel/item/title" />
<field column="id" xpath="/rss/channel/item/guid" />
<field column="link" xpath="/rss/channel/item/link" />
<!-- Use the RegexTransformer to strip out ads -->
<field column="description"
xpath="/rss/channel/item/description" regex="<a.*?</a>"
replaceWith=""/>
<field column="category"
xpath="/rss/channel/item/category" />
<!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
<field column="pubDate" xpath="/rss/channel/item/pubDate"
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
</entity>
</dataConfig>
Then in my browser:
http://localhost:8983/solr/dataimport?command=full-import&clean=true
Then on the command line in Mahout home:
> mvn dependecy:copy-dependencies
> cd target/dependency
> java -cp "*" org.apache.mahout.utils.vectors.lucene.Driver --dir [path to
> index]/data/index/ --output ./solr-clust-n2/part-out.vec --field
> desc-clustering --idField id --dictOut ./solr-clust-n2/dictionary.txt --norm 2
> java -Xmx1024M -cp "*" org.apache.mahout.clustering.kmeans.KMeansDriver
> --input ./solr-clust-n2/part-out.vec --clusters ./solr-clust-n2/out/clusters
> --output ./solr-clust-n2/out/ --distance
> org.apache.mahout.common.distance.CosineDistanceMeasure --convergence 0.001
> --overwrite --k 25
> java -Xmx1024M -cp "*" org.apache.mahout.utils.clustering.ClusterDumper
> --seqFileDir ./solr-clust-n2/out/clusters-2 --dictionary
> ./solr-clust-n2/dictionary.txt --substring 100 --pointsDir
> ./solr-clust-n2/out/points/
or:
> java -Xmx1024M -cp "*" org.apache.mahout.utils.vectors.lucene.ClusterLabels
> --dir [path to index]/data/index/ --field description --idField id
> --seqFileDir ./solr-clust-n2/out/clusters-2 --pointsDir
> ./solr-clust-n2/out/points/ --minClusterSize 5 --maxLabels 10
-Grant