[ https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894022#comment-17894022 ]
Hiran Chaudhuri commented on NUTCH-3087: ---------------------------------------- I changed src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java by adding these lines (check line 239): {code:java} if (LOG.isTraceEnabled()) { LOG.trace("found " + outlinks.length + " outlinks in " + content.getUrl()); for (Outlink outlink: outlinks) { LOG.trace(" -> {}", outlink); } } {code} With that I am getting the following output. Note that all the eight outlinks to correctly carry the userinfo. {code:java} 2024-10-29 22:57:24,963 TRACE org.apache.nutch.util.EncodingDetector [parse-0] smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.: Choosing encoding: windows-1252 (default) 2024-10-29 22:57:24,964 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Parsing... 2024-10-29 22:57:24,997 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Meta tags for smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null * general tags: * http-equiv tags:2024-10-29 22:57:24,997 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Getting text... 2024-10-29 22:57:24,998 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Getting title... 2024-10-29 22:57:24,998 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Getting links... 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] found 8 outlinks in smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.svn/ anchor: .svn/ Tue Oct 24 13:32:32 CEST 2017 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: architektur.dia Mon Feb 22 21:30:33 CET 2010 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia%7E anchor: architektur.dia~ Mon Feb 22 21:20:42 CET 2010 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.png anchor: architektur.png Mon Feb 22 21:34:27 CET 2010 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: deployment.dia Mon Feb 22 22:56:15 CET 2010 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia%7E anchor: deployment.dia~ Mon Feb 22 22:51:21 CET 2010 2024-10-29 22:57:24,999 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.png anchor: deployment.png Mon Feb 22 23:00:34 CET 2010 2024-10-29 22:57:25,000 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 {code} > Nutch crawling inconsistent on URLs with userinfo > ------------------------------------------------- > > Key: NUTCH-3087 > URL: https://issues.apache.org/jira/browse/NUTCH-3087 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Reporter: Hiran Chaudhuri > Priority: Major > > I am trying to scan the URL > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > Note the userinfo 'hiran', which is used for authentication on the server. > (The smb plugin pulls credentials from another configuration file, but this > is irrelevant here). > The URL is fetched, parsed, updated in the crawldb and sent to the indexer. > So far so good. But the outlinks that are detected are of different quality: > some have the userinfo preserved, some are missing that information. > Dumping the segment I can see the below data. Note that some of the outlinks > start with smb://hi...@nas.fritz.box, while others start with > smb://nas.fritz.box. The impact is that on the next fetch run authentication > information is missing and the URLs cannot be fetched further. > > {code:java} > Recno:: 0 > URL:: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Tue Oct 29 22:56:58 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > Content:: > Version: -1 > url: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > base: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/. > contentType: text/html > metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0 > Content: > <html><head><title>Index of > /Documents/Hiran/Monitoring/</title></head><body><h1>Index of > /Documents/Hiran/Monitoring/</h1><pre><a href=".svn/">.svn/ Tue Oct 24 > 13:32:32 CEST 2017</a> > <a href="architektur.dia">architektur.dia Mon Feb 22 21:30:33 CET 2010</a> > <a href="architektur.dia%7E">architektur.dia~ Mon Feb 22 21:20:42 CET > 2010</a> > <a href="architektur.png">architektur.png Mon Feb 22 21:34:27 CET 2010</a> > <a href="deployment.dia">deployment.dia Mon Feb 22 22:56:15 CET 2010</a> > <a href="deployment.dia%7E">deployment.dia~ Mon Feb 22 22:51:21 CET > 2010</a> > <a href="deployment.png">deployment.png Mon Feb 22 23:00:34 CET 2010</a> > <a href="Monitoring+strategy.odt">Monitoring strategy.odt Fri Aug 01 > 13:38:04 CEST 2014</a> > </pre></body></html> > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /Documents/Hiran/Monitoring/ > Outlinks: 5 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: > architektur.dia Mon Feb 22 21:30:33 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor: > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: > deployment.dia Mon Feb 22 22:56:15 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor: > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt > anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > Content Metadata: > nutch.segment.name = 20241029225708 > nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc > _fst_ = 33 > nutch.crawl.score = 1.0 > Parse Metadata: > CharEncodingForConversion = windows-1252 > OriginalCharEncoding = windows-1252 > language = en > CrawlDatum:: > Version: 7 > Status: 65 (signature) > Fetch time: Tue Oct 29 22:57:25 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 0.0 > Signature: a794c6675cb2f9e460e7771060ed2dfc > Metadata: > > CrawlDatum:: > Version: 7 > Status: 33 (fetch_success) > Fetch time: Tue Oct 29 22:57:17 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > _pst_=success(1), lastModified=0 > Content-Type=text/html > ParseText:: > Index of /Documents/Hiran/Monitoring/ > Index of /Documents/Hiran/Monitoring/ > .svn/ Tue Oct 24 13:32:32 CEST 2017 > architektur.dia Mon Feb 22 21:30:33 CET 2010 > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > architektur.png Mon Feb 22 21:34:27 CET 2010 > deployment.dia Mon Feb 22 22:56:15 CET 2010 > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > deployment.png Mon Feb 22 23:00:34 CET 2010 > Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > {code} > > Addendum: My regex-urlfilter.txt looks like this: > {code:java} > # skip file: ftp: and mailto: urls > -^(?:file|ftp|mailto): > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$ > # skip URLs containing certain characters as probable queries, etc. > -[?*!=] > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > # skip version control internal files > -(?i)\.(?:git|svn|cvs)$ > # skip recycle bin URLs > -(?i)/%23recycle/$ > -/\.svn/ > -/\.git/ > # accept anything else > +. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)