[algogeeks] Re: Web content analysis (document distances / categorization)

Abhishek Thu, 03 Jan 2008 00:50:57 -0800

Do a Named entity Recognition on both the documents and then try and
see how similar are these?
Can you tell me how well/ill structured are your documents?
You could use a distance metric using these named entities.


Abhishek S


On Jan 3, 1:35 am, Arun <[EMAIL PROTECTED]> wrote:
> for first part, search the web for "document similarity"
> for categorization, u can try to use the std web directories like yahoo!
> directory or dmoz, although many docs maynot be found there.(i guess). i
> believe they have some APIs to do. or search for "document categorization"
>
> On Jan 2, 2008 12:07 PM, ramzabean <[EMAIL PROTECTED]> wrote:
>
>
>
> > I am working on a project to analyze web documents;   there are two
> > main components that I am researching.  Ideally, I want to find out
> > how related one document is compared to another document.  That
> > comparison should be based on term frequency, possibly a lexicon word
> > pool lookup? something along those lines.  One approach I found is
> > through the use of the Rocchio method.
>
> > For the second part, categorization; I am having a little trouble how
> > to do this.  I want to build the categories dynamically?  For example,
> > if I have 1 billion documents, I want to parse those documents and
> > come up with N number of categories with a sub division of links?
> > Lets say 1000 categories.  I was thinking I could create the
> > categories based on the DESCRIPTION/KEYWORDS meta information and then
> > use some bayesian analysis to dump more links into that particular
> > category?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---

[algogeeks] Re: Web content analysis (document distances / categorization)

Reply via email to