I should note, I am still validating the quality of the results and that the DIH stuff is just a sample of all the feeds I'm using.
On Jan 2, 2010, at 3:56 PM, Grant Ingersoll wrote: > > On Jan 2, 2010, at 3:11 PM, Drew Farris wrote: > >> I've managed to get k-means clustering working, but I agree it would be very >> nice to have an end-to-end example that would allow others to get up to >> speed quickly. I think the largest holes here are related to the vacuum of a >> corpus of text into the Lucene index and the presentation of a >> human-readable display of the results. It might be interesting to also >> calculate and include some metrics such as the F-measure (in cases where we >> have a reference categorization) and scatter score (in cases where we >> don't). >> >> The existing LDA example would be a useful starting point. It slurps >> in the Reuters-21578 >> corpus <http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html>, >> converts it to text, loads it into a Lucene index, extracts vectors from the >> lucene index and runs LDA upon them. >> >> This example uses the lucene benchmark utilities for the input to text >> conversion and lucene loading. The benchmark utilities code is readable but >> complex. It would be very nice to have a simple piece of code to handle the >> creation of the Lucene index that others can easilly build upon to respond >> to their existing corpus. >> > > > +1. > > I've also got this working for a bunch of RSS feeds using Solr's > DataImportHandler and the following commands: > > In Solr, I setup the DataImportHandler with something like: > <dataConfig> > > <dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/> > <document> > <!-- New York Times Sports feed --> > <entity name="nytSportsFeed" > pk="link" > url="http://feeds1.nytimes.com/nyt/rss/Sports" > processor="XPathEntityProcessor" > forEach="/rss/channel | /rss/channel/item" > dataSource="rss" > transformer="RegexTransformer,DateFormatTransformer"> > <field column="source" xpath="/rss/channel/title" > commonField="true" /> > <field column="source-link" xpath="/rss/channel/link" > commonField="true" /> > <field column="title" xpath="/rss/channel/item/title" /> > <field column="id" xpath="/rss/channel/item/guid" /> > <field column="link" xpath="/rss/channel/item/link" /> > <!-- Use the RegexTransformer to strip out ads --> > <field column="description" > xpath="/rss/channel/item/description" regex="<a.*?</a>" > replaceWith=""/> > <field column="category" > xpath="/rss/channel/item/category" /> > <!-- 'Sun, 18 May 2008 11:23:11 +0000' --> > <field column="pubDate" xpath="/rss/channel/item/pubDate" > dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" /> > </entity> > <entity name="nytWorld" > pk="link" > url="http://feeds.nytimes.com/nyt/rss/World" > processor="XPathEntityProcessor" > forEach="/rss/channel | /rss/channel/item" > dataSource="rss" > transformer="RegexTransformer,DateFormatTransformer"> > <field column="source" xpath="/rss/channel/title" > commonField="true" /> > <field column="source-link" xpath="/rss/channel/link" > commonField="true" /> > <field column="title" xpath="/rss/channel/item/title" /> > <field column="id" xpath="/rss/channel/item/guid" /> > <field column="link" xpath="/rss/channel/item/link" /> > <!-- Use the RegexTransformer to strip out ads --> > <field column="description" > xpath="/rss/channel/item/description" regex="<a.*?</a>" > replaceWith=""/> > <field column="category" > xpath="/rss/channel/item/category" /> > <!-- 'Sun, 18 May 2008 11:23:11 +0000' --> > <field column="pubDate" xpath="/rss/channel/item/pubDate" > dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" /> > </entity> > > </dataConfig> > > Then in my browser: > http://localhost:8983/solr/dataimport?command=full-import&clean=true > > Then on the command line in Mahout home: >> mvn dependecy:copy-dependencies >> cd target/dependency >> java -cp "*" org.apache.mahout.utils.vectors.lucene.Driver --dir [path to >> index]/data/index/ --output ./solr-clust-n2/part-out.vec --field >> desc-clustering --idField id --dictOut ./solr-clust-n2/dictionary.txt --norm >> 2 >> java -Xmx1024M -cp "*" org.apache.mahout.clustering.kmeans.KMeansDriver >> --input ./solr-clust-n2/part-out.vec --clusters ./solr-clust-n2/out/clusters >> --output ./solr-clust-n2/out/ --distance >> org.apache.mahout.common.distance.CosineDistanceMeasure --convergence 0.001 >> --overwrite --k 25 >> java -Xmx1024M -cp "*" org.apache.mahout.utils.clustering.ClusterDumper >> --seqFileDir ./solr-clust-n2/out/clusters-2 --dictionary >> ./solr-clust-n2/dictionary.txt --substring 100 --pointsDir >> ./solr-clust-n2/out/points/ > or: >> java -Xmx1024M -cp "*" org.apache.mahout.utils.vectors.lucene.ClusterLabels >> --dir [path to index]/data/index/ --field description --idField id >> --seqFileDir ./solr-clust-n2/out/clusters-2 --pointsDir >> ./solr-clust-n2/out/points/ --minClusterSize 5 --maxLabels 10 > > > -Grant -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
