[ 
https://issues.apache.org/jira/browse/NUTCH-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272016#comment-17272016
 ] 

ASF GitHub Bot commented on NUTCH-2845:
---------------------------------------

sebastian-nagel commented on pull request #564:
URL: https://github.com/apache/nutch/pull/564#issuecomment-767435797


   Hi @lewismc, the template rules of urlfilter-suffix already include suffixes 
typically indicating image, video, package and binary formats. But the suffix 
set dates back to 2006 (34067f7), so this PR adds suffixes for recent formats, 
e.g. if .png and .jpg are excluded excluded, also .webp should.
   
   Whether or not a crawl shall include multimedia content depends on the use 
case. By default, urlfilter-suffix is not included in plugin.includes. Even if 
Tika can extract text from many multimedia and package formats - there's a 
tradeoff between bandwidth and outcome. While HTML pages are usually small 
(even smaller with compression during HTTP transfer), crawling multimedia files 
or software packages can consume a substantial part of your network bandwidth.
   
   But again: it's on the users to enable the suffix filter and to modify the 
rules at their needs. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Update urlfilter-suffix rules
> -----------------------------
>
>                 Key: NUTCH-2845
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2845
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin, urlfilter
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.19
>
>
> The rules of urlfilter-suffix should be update to include recent file formats 
> of
>  - images
>    .icns (Apple Icon Image Format)
>    .tif (TIFF, alternate pattern)
>    .webp (WebP)
>  - archive and software package formats
>    .apk
>    .bz2
>    .xz
>  - videos
>    .mp4
>    .webm
>    .m4v
>    .qt (QuickTime)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to