sebastian-nagel commented on pull request #564: URL: https://github.com/apache/nutch/pull/564#issuecomment-767435797
Hi @lewismc, the template rules of urlfilter-suffix already include suffixes typically indicating image, video, package and binary formats. But the suffix set dates back to 2006 (34067f7), so this PR adds suffixes for recent formats, e.g. if .png and .jpg are excluded excluded, also .webp should. Whether or not a crawl shall include multimedia content depends on the use case. By default, urlfilter-suffix is not included in plugin.includes. Even if Tika can extract text from many multimedia and package formats - there's a tradeoff between bandwidth and outcome. While HTML pages are usually small (even smaller with compression during HTTP transfer), crawling multimedia files or software packages can consume a substantial part of your network bandwidth. But again: it's on the users to enable the suffix filter and to modify the rules at their needs. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]

