On Fri, Apr 30, 2010 at 10:40 PM, Bogdan Vatkov <bogdan.vat...@gmail.com>wrote:

> Hi Grant,
>
> You are probably right.
> What I wanted is to use my mahout setup to extract topics from a single
> document.
> So, maybe in popular terms I am trying to do topic extraction via document
> clustering.
> Does it make sense to try to split a doc into sub docs so that I leverage
> the clustering algorithm and thus find topic which appear key ones for the
> document?
>
Have you heard of LDA (Its in Mahout). Or are you trying to do something
different for topic extraction ?

>
> Best regards,
> Bogdan
>
> On Fri, Apr 30, 2010 at 6:18 PM, Grant Ingersoll <gsing...@apache.org
> >wrote:
>
> > This strike me a little bit as an XY problem:
> > http://people.apache.org/~hossman/#xyproblem
> >
> > Perhaps it would be helpful if you could back up a little and describe
> the
> > higher level problem you are trying to solve.  You certainly can split up
> > your documents and then cluster them, but I'm not sure that is actually
> > going to give you what you need.
> >
> > Cheers,
> > Grant
> >
> > On Apr 30, 2010, at 5:29 AM, Bogdan Vatkov wrote:
> >
> > > Hi,
> > >
> > > I would like to run some clustering for a single document but then I
> want
> > > that multiple clusters are extracted.
> > > I guess I have to find a way to split the doc into multiple docs /
> input
> > > vectors but I am wondering if there are any best practices on how to do
> > the
> > > split then
> > > Should I derive vectors based on sentences or paragraphs? Is there a
> > > paragraph boundary detection tool around?
> > > Any recommendations will be appreciated.
> > >
> > > Best regards,
> > > Bogdan
> >
> >
> >
>
>
> --
> Best regards,
> Bogdan
>

Reply via email to