Hi Grant, You are probably right. What I wanted is to use my mahout setup to extract topics from a single document. So, maybe in popular terms I am trying to do topic extraction via document clustering. Does it make sense to try to split a doc into sub docs so that I leverage the clustering algorithm and thus find topic which appear key ones for the document?
Best regards, Bogdan On Fri, Apr 30, 2010 at 6:18 PM, Grant Ingersoll <gsing...@apache.org>wrote: > This strike me a little bit as an XY problem: > http://people.apache.org/~hossman/#xyproblem > > Perhaps it would be helpful if you could back up a little and describe the > higher level problem you are trying to solve. You certainly can split up > your documents and then cluster them, but I'm not sure that is actually > going to give you what you need. > > Cheers, > Grant > > On Apr 30, 2010, at 5:29 AM, Bogdan Vatkov wrote: > > > Hi, > > > > I would like to run some clustering for a single document but then I want > > that multiple clusters are extracted. > > I guess I have to find a way to split the doc into multiple docs / input > > vectors but I am wondering if there are any best practices on how to do > the > > split then > > Should I derive vectors based on sentences or paragraphs? Is there a > > paragraph boundary detection tool around? > > Any recommendations will be appreciated. > > > > Best regards, > > Bogdan > > > -- Best regards, Bogdan