Hi,
It would be nice to use the features of Nutch instead of my own hacky
stuff. How bound is Nutch to the J2EE-container? Would it be a big job
to make it run on an alternative GUI? Or is is the container used for
more than GUI? I.e. do all services (crawler, et.c.) run within the
container? Do they have to?
Crawler & Co. are command line tools.
The servletcontainer is only used to deliver search results but you
can use the servlet that just provides XML.
Also you can use the NutchBean API to integrate it without any
servlet container in a custom application.
It would be nice to automatically detect the content "frame" by
analyzing the DOM tree of the pages on a site. Is there such a feature
in Nutch, contributed to, or publicly available in some other project?
I'm not sure clearly understanding your question here.
Nutch has a html parser plugin that only extract the content from a
html page.
Stefan