we are on nutch 2.3.1 and using it to crawl our websites.
One of our focus is to get all the pdfs on our website crawled. -> Links on
different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
I tried different things:
At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt
and added the download url, added parse-tika to nutch-.site.xml in plugins,
added application/pdf in default-site.xml in http-accept, added pdf to
But still no pdf link is been fetched.
<plugin id="parse-tika" />
<description>Value of the "Accept" request header field.
Is there anything else I have to configure?