Re: classifying content

chad savage Thu, 07 Dec 2006 09:53:04 -0800

Hey Eelco,

We would like to organize information into a hierarchical categorysystem. It's all general web content(html from the web).Yes, there are a number of references to varying techniques on the net(scientific papers, theoretical, practical, mind boggling). My problemis determining the best method. and of course implementing it with mylimited nutch/java abilities. May have to outsource most of this.Not to mention the many formats for ontologies: owl,rdf,daml, someothers I am sure I'm missing.

We would like to be able to crawl the web and categorize the pages intobuckets. We currently have a number of separate configs for nutch allcrawling different subsets of our web sites with multiple indexes as astart for being able to search separate categories. The goal is to haveone crawl that can scan all of the websites and index the content intothese predetermined buckets and keep them in one master index.

If there are any groups out there that handle this I would be more thanhappy to discuss techniques and possible outsourcing.


Chad


Eelco Lempsink wrote:

On 5-dec-2006, at 7:01, chad savage wrote:
I'm doing some research on how to classify documents into pre-definedcategories.
On basis of...? The technique that's the most appropriate depends onthe type of documents and the type of categories. For instance, arethe documents structured (e.g. all XML using a common definition) orunstructured data (HTML from the web)? Are you looking the placedocuments in a large hierarchical category system or is it a simplebinary decision (e.g. 'Spam' or 'No spam').
If you know what you want and how it's called it should be relativelyeasy to find information and scientific papers about it.
--Regards,

Eelco Lempsink

Re: classifying content

Reply via email to