Do a Named entity Recognition on both the documents and then try and see how similar are these? Can you tell me how well/ill structured are your documents? You could use a distance metric using these named entities.
Abhishek S On Jan 3, 1:35 am, Arun <[EMAIL PROTECTED]> wrote: > for first part, search the web for "document similarity" > for categorization, u can try to use the std web directories like yahoo! > directory or dmoz, although many docs maynot be found there.(i guess). i > believe they have some APIs to do. or search for "document categorization" > > On Jan 2, 2008 12:07 PM, ramzabean <[EMAIL PROTECTED]> wrote: > > > > > I am working on a project to analyze web documents; there are two > > main components that I am researching. Ideally, I want to find out > > how related one document is compared to another document. That > > comparison should be based on term frequency, possibly a lexicon word > > pool lookup? something along those lines. One approach I found is > > through the use of the Rocchio method. > > > For the second part, categorization; I am having a little trouble how > > to do this. I want to build the categories dynamically? For example, > > if I have 1 billion documents, I want to parse those documents and > > come up with N number of categories with a sub division of links? > > Lets say 1000 categories. I was thinking I could create the > > categories based on the DESCRIPTION/KEYWORDS meta information and then > > use some bayesian analysis to dump more links into that particular > > category? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/algogeeks -~----------~----~----~----~------~----~------~--~---
