[ https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894544#comment-17894544 ]
Hiran Chaudhuri edited comment on NUTCH-3087 at 10/31/24 12:00 PM: ------------------------------------------------------------------- urlnormalizer? Maybe that is the case. How do I find out whether/which one is active? Aha! This is on my log when running the parse step: {code:java} 2024-10-31 12:55:44,761 INFO org.apache.nutch.plugin.PluginManifestParser [main] Plugins: looking in: /home/hiran/NetBeansProjects/nutch/runtime/local/plugins 2024-10-31 12:55:44,833 INFO org.apache.nutch.plugin.PluginRepository [main] Plugin Auto-activation mode: [true] 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Registered Plugins: 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Regex URL Filter (urlfilter-regex) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Html Parse Plug-in (parse-html) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] the nutch core extension points (nutch-extensionpoints) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Basic Indexing Filter (index-basic) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Anchor Indexing Filter (index-anchor) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Tika Parser Plug-in (parse-tika) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Basic URL Normalizer (urlnormalizer-basic) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Index Static (index-static) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Top Level Domain Plugin (tld) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Regex URL Filter Framework (lib-regex-filter) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Language Identification Parser/Filter (language-identifier) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Regex URL Normalizer (urlnormalizer-regex) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] CyberNeko HTML Parser (lib-nekohtml) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Subcollection indexing and query filter (subcollection) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] URL Meta Indexing Filter (urlmeta) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] OPIC Scoring Plug-in (scoring-opic) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Pass-through URL Normalizer (urlnormalizer-pass) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] SMB Protocol based on https://github.com/hierynomus/smbj (protocol-smb) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] More Indexing Filter (index-more) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] SolrIndexWriter (indexer-solr) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Creative Commons Plugins (creativecommons) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Replace Indexer (index-replace) 2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main] Registered Extension-Points: 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Content Parser) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch URL Filter) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (HTML Parse Filter) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Scoring) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch URL Normalizer) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Publisher) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Exchange) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Protocol) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch URL Ignore Exemption Filter) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Index Writer) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Segment Merge Filter) 2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main] (Nutch Indexing Filter) 2024-10-31 12:55:44,897 WARN org.apache.hadoop.util.NativeCodeLoader [main] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2024-10-31 12:55:44,949 INFO org.apache.nutch.parse.ParseSegment [main] ParseSegment: starting 2024-10-31 12:55:44,949 INFO org.apache.nutch.parse.ParseSegment [main] ParseSegment: segment: crawl/segments/20241031125529 {code} So it is confirmed: There are several normalizer plugins active: * urlnormalizer-basic * urlnormalizer-regex * urlnormalizer-pass was (Author: hiranchaudhuri): urlnormalizer? Maybe that is the case. How do I find out whether/which one is active? > Nutch crawling inconsistent on URLs with userinfo > ------------------------------------------------- > > Key: NUTCH-3087 > URL: https://issues.apache.org/jira/browse/NUTCH-3087 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Reporter: Hiran Chaudhuri > Priority: Major > > I am trying to scan the URL > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > Note the userinfo 'hiran', which is used for authentication on the server. > (The smb plugin pulls credentials from another configuration file, but this > is irrelevant here). > The URL is fetched, parsed, updated in the crawldb and sent to the indexer. > So far so good. But the outlinks that are detected are of different quality: > some have the userinfo preserved, some are missing that information. > Dumping the segment I can see the below data. Note that some of the outlinks > start with smb://hi...@nas.fritz.box, while others start with > smb://nas.fritz.box. The impact is that on the next fetch run authentication > information is missing and the URLs cannot be fetched further. > > {code:java} > Recno:: 0 > URL:: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Tue Oct 29 22:56:58 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > Content:: > Version: -1 > url: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > base: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/. > contentType: text/html > metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0 > Content: > <html><head><title>Index of > /Documents/Hiran/Monitoring/</title></head><body><h1>Index of > /Documents/Hiran/Monitoring/</h1><pre><a href=".svn/">.svn/ Tue Oct 24 > 13:32:32 CEST 2017</a> > <a href="architektur.dia">architektur.dia Mon Feb 22 21:30:33 CET 2010</a> > <a href="architektur.dia%7E">architektur.dia~ Mon Feb 22 21:20:42 CET > 2010</a> > <a href="architektur.png">architektur.png Mon Feb 22 21:34:27 CET 2010</a> > <a href="deployment.dia">deployment.dia Mon Feb 22 22:56:15 CET 2010</a> > <a href="deployment.dia%7E">deployment.dia~ Mon Feb 22 22:51:21 CET > 2010</a> > <a href="deployment.png">deployment.png Mon Feb 22 23:00:34 CET 2010</a> > <a href="Monitoring+strategy.odt">Monitoring strategy.odt Fri Aug 01 > 13:38:04 CEST 2014</a> > </pre></body></html> > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /Documents/Hiran/Monitoring/ > Outlinks: 5 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: > architektur.dia Mon Feb 22 21:30:33 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor: > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: > deployment.dia Mon Feb 22 22:56:15 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor: > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt > anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > Content Metadata: > nutch.segment.name = 20241029225708 > nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc > _fst_ = 33 > nutch.crawl.score = 1.0 > Parse Metadata: > CharEncodingForConversion = windows-1252 > OriginalCharEncoding = windows-1252 > language = en > CrawlDatum:: > Version: 7 > Status: 65 (signature) > Fetch time: Tue Oct 29 22:57:25 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 0.0 > Signature: a794c6675cb2f9e460e7771060ed2dfc > Metadata: > > CrawlDatum:: > Version: 7 > Status: 33 (fetch_success) > Fetch time: Tue Oct 29 22:57:17 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > _pst_=success(1), lastModified=0 > Content-Type=text/html > ParseText:: > Index of /Documents/Hiran/Monitoring/ > Index of /Documents/Hiran/Monitoring/ > .svn/ Tue Oct 24 13:32:32 CEST 2017 > architektur.dia Mon Feb 22 21:30:33 CET 2010 > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > architektur.png Mon Feb 22 21:34:27 CET 2010 > deployment.dia Mon Feb 22 22:56:15 CET 2010 > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > deployment.png Mon Feb 22 23:00:34 CET 2010 > Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > {code} > > Addendum: It is ok to have only 5 outlinks from a document with 8 anchors. > The .svn and the two .png links are ignored. My regex-urlfilter.txt looks > like this: > {code:java} > # skip file: ftp: and mailto: urls > -^(?:file|ftp|mailto): > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$ > # skip URLs containing certain characters as probable queries, etc. > -[?*!=] > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > # skip version control internal files > -(?i)\.(?:git|svn|cvs)$ > # skip recycle bin URLs > -(?i)/%23recycle/$ > -/\.svn/ > -/\.git/ > # accept anything else > +. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)