[ https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894476#comment-17894476 ]
Sebastian Nagel commented on NUTCH-3087: ---------------------------------------- Which of the URL normalizers are active? For example, urlnormalizer-basic removes the userinfo part for https, http and ftp URLs. There might be a bug which does it as well for other schemes, in case there other parts of the URL are normalized. Looks like this is a pattern: {{.../architektur.dia%7E}} -> {{.../architektur.dia~}} > Nutch crawling inconsistent on URLs with userinfo > ------------------------------------------------- > > Key: NUTCH-3087 > URL: https://issues.apache.org/jira/browse/NUTCH-3087 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Reporter: Hiran Chaudhuri > Priority: Major > > I am trying to scan the URL > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > Note the userinfo 'hiran', which is used for authentication on the server. > (The smb plugin pulls credentials from another configuration file, but this > is irrelevant here). > The URL is fetched, parsed, updated in the crawldb and sent to the indexer. > So far so good. But the outlinks that are detected are of different quality: > some have the userinfo preserved, some are missing that information. > Dumping the segment I can see the below data. Note that some of the outlinks > start with smb://hi...@nas.fritz.box, while others start with > smb://nas.fritz.box. The impact is that on the next fetch run authentication > information is missing and the URLs cannot be fetched further. > > {code:java} > Recno:: 0 > URL:: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Tue Oct 29 22:56:58 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > Content:: > Version: -1 > url: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > base: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/. > contentType: text/html > metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0 > Content: > <html><head><title>Index of > /Documents/Hiran/Monitoring/</title></head><body><h1>Index of > /Documents/Hiran/Monitoring/</h1><pre><a href=".svn/">.svn/ Tue Oct 24 > 13:32:32 CEST 2017</a> > <a href="architektur.dia">architektur.dia Mon Feb 22 21:30:33 CET 2010</a> > <a href="architektur.dia%7E">architektur.dia~ Mon Feb 22 21:20:42 CET > 2010</a> > <a href="architektur.png">architektur.png Mon Feb 22 21:34:27 CET 2010</a> > <a href="deployment.dia">deployment.dia Mon Feb 22 22:56:15 CET 2010</a> > <a href="deployment.dia%7E">deployment.dia~ Mon Feb 22 22:51:21 CET > 2010</a> > <a href="deployment.png">deployment.png Mon Feb 22 23:00:34 CET 2010</a> > <a href="Monitoring+strategy.odt">Monitoring strategy.odt Fri Aug 01 > 13:38:04 CEST 2014</a> > </pre></body></html> > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /Documents/Hiran/Monitoring/ > Outlinks: 5 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: > architektur.dia Mon Feb 22 21:30:33 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor: > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: > deployment.dia Mon Feb 22 22:56:15 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor: > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt > anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > Content Metadata: > nutch.segment.name = 20241029225708 > nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc > _fst_ = 33 > nutch.crawl.score = 1.0 > Parse Metadata: > CharEncodingForConversion = windows-1252 > OriginalCharEncoding = windows-1252 > language = en > CrawlDatum:: > Version: 7 > Status: 65 (signature) > Fetch time: Tue Oct 29 22:57:25 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 0.0 > Signature: a794c6675cb2f9e460e7771060ed2dfc > Metadata: > > CrawlDatum:: > Version: 7 > Status: 33 (fetch_success) > Fetch time: Tue Oct 29 22:57:17 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > _pst_=success(1), lastModified=0 > Content-Type=text/html > ParseText:: > Index of /Documents/Hiran/Monitoring/ > Index of /Documents/Hiran/Monitoring/ > .svn/ Tue Oct 24 13:32:32 CEST 2017 > architektur.dia Mon Feb 22 21:30:33 CET 2010 > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > architektur.png Mon Feb 22 21:34:27 CET 2010 > deployment.dia Mon Feb 22 22:56:15 CET 2010 > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > deployment.png Mon Feb 22 23:00:34 CET 2010 > Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > {code} > > Addendum: It is ok to have only 5 outlinks from a document with 8 anchors. > The .svn and the two .png links are ignored. My regex-urlfilter.txt looks > like this: > {code:java} > # skip file: ftp: and mailto: urls > -^(?:file|ftp|mailto): > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$ > # skip URLs containing certain characters as probable queries, etc. > -[?*!=] > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > # skip version control internal files > -(?i)\.(?:git|svn|cvs)$ > # skip recycle bin URLs > -(?i)/%23recycle/$ > -/\.svn/ > -/\.git/ > # accept anything else > +. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)