mos wrote:
The problem at www.gildemeister.com is the use of JavaScript for link
generation.
That's the reason why nutch can't find the other pages (the links are
invisible).
Two ideas:
- You need something like a sitemap, that links the other main pages.
If it's not available
  right now, you should try to generate it (e.g. use the apache log-file)
- Enhance the nutch html parser and make it able to intepret the JavaScipt links

You can try activating parse-js - it can extract JavaScript snippets embedded in HTML actions, and figure out the links. It works reasonably well, at least most of the time... ;-)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to