On Fri, Apr 30, 2010 at 10:40 PM, Bogdan Vatkov <bogdan.vat...@gmail.com>wrote:
> Hi Grant, > > You are probably right. > What I wanted is to use my mahout setup to extract topics from a single > document. > So, maybe in popular terms I am trying to do topic extraction via document > clustering. > Does it make sense to try to split a doc into sub docs so that I leverage > the clustering algorithm and thus find topic which appear key ones for the > document? > Have you heard of LDA (Its in Mahout). Or are you trying to do something different for topic extraction ? > > Best regards, > Bogdan > > On Fri, Apr 30, 2010 at 6:18 PM, Grant Ingersoll <gsing...@apache.org > >wrote: > > > This strike me a little bit as an XY problem: > > http://people.apache.org/~hossman/#xyproblem > > > > Perhaps it would be helpful if you could back up a little and describe > the > > higher level problem you are trying to solve. You certainly can split up > > your documents and then cluster them, but I'm not sure that is actually > > going to give you what you need. > > > > Cheers, > > Grant > > > > On Apr 30, 2010, at 5:29 AM, Bogdan Vatkov wrote: > > > > > Hi, > > > > > > I would like to run some clustering for a single document but then I > want > > > that multiple clusters are extracted. > > > I guess I have to find a way to split the doc into multiple docs / > input > > > vectors but I am wondering if there are any best practices on how to do > > the > > > split then > > > Should I derive vectors based on sentences or paragraphs? Is there a > > > paragraph boundary detection tool around? > > > Any recommendations will be appreciated. > > > > > > Best regards, > > > Bogdan > > > > > > > > > -- > Best regards, > Bogdan >