Andrzej, Thanks.
A related question: Some of the sites I crawl use https: or redirect to
https:. Nutch default setting does not recognize https: as valid url.
Is there a way to crawl url starting with "https:"?
-AJ
Andrzej Bialecki wrote:
AJ Chen wrote:
Hi Andrzej,
Thanks for the suggestion. I'm using pdf plugin that
comes with nutch from vsn. Where to get the PDFBox
unreleased version 0.7.2 that works for you?
http://www.pdfbox.com/dist
If you are not too familiar with the classpath setting in plugin.xml
then it's better to just replace the old JAR with the new one, but
keeping the same name as the old JAR.