Hey Eelco,
We would like to organize information into a hierarchical category
system. It's all general web content(html from the web).
Yes, there are a number of references to varying techniques on the net
(scientific papers, theoretical, practical, mind boggling). My problem
is determining the best method. and of course implementing it with my
limited nutch/java abilities. May have to outsource most of this.
Not to mention the many formats for ontologies: owl,rdf,daml, some
others I am sure I'm missing.
We would like to be able to crawl the web and categorize the pages into
buckets. We currently have a number of separate configs for nutch all
crawling different subsets of our web sites with multiple indexes as a
start for being able to search separate categories. The goal is to have
one crawl that can scan all of the websites and index the content into
these predetermined buckets and keep them in one master index.
If there are any groups out there that handle this I would be more than
happy to discuss techniques and possible outsourcing.
Chad
Eelco Lempsink wrote:
On 5-dec-2006, at 7:01, chad savage wrote:
I'm doing some research on how to classify documents into pre-defined
categories.
On basis of...? The technique that's the most appropriate depends on
the type of documents and the type of categories. For instance, are
the documents structured (e.g. all XML using a common definition) or
unstructured data (HTML from the web)? Are you looking the place
documents in a large hierarchical category system or is it a simple
binary decision (e.g. 'Spam' or 'No spam').
If you know what you want and how it's called it should be relatively
easy to find information and scientific papers about it.
--Regards,
Eelco Lempsink