[ https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894547#comment-17894547 ]
Hiran Chaudhuri commented on NUTCH-3087: ---------------------------------------- With logging turned on to the highest I am now getting this. Not very talkative about the impact though... {code:java} 2024-10-31 13:05:41,713 DEBUG org.apache.nutch.parse.ParseUtil [LocalJobRunner Map Task Executor #0] Parsing [smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/] with [org.apache.nutch.parse.html.HtmlParser@2422ff89] 2024-10-31 13:05:41,729 TRACE org.apache.nutch.util.EncodingDetector [parse-0] smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.: Choosing encoding: windows-1252 (default) 2024-10-31 13:05:41,729 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Parsing... 2024-10-31 13:05:41,762 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Meta tags for smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null * general tags: * http-equiv tags:2024-10-31 13:05:41,762 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Getting text... 2024-10-31 13:05:41,763 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Getting title... 2024-10-31 13:05:41,763 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] Getting links... 2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] found 8 outlinks in smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ 2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.svn/ anchor: .svn/ Tue Oct 24 13:32:32 CEST 2017 2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: architektur.dia Mon Feb 22 21:30:33 CET 2010 2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia%7E anchor: architektur.dia~ Mon Feb 22 21:20:42 CET 2010 2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.png anchor: architektur.png Mon Feb 22 21:34:27 CET 2010 2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: deployment.dia Mon Feb 22 22:56:15 CET 2010 2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia%7E anchor: deployment.dia~ Mon Feb 22 22:51:21 CET 2010 2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.png anchor: deployment.png Mon Feb 22 23:00:34 CET 2010 2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 2024-10-31 13:05:41,772 INFO org.apache.nutch.crawl.SignatureFactory [LocalJobRunner Map Task Executor #0] Using Signature impl: org.apache.nutch.crawl.MD5Signature 2024-10-31 13:05:41,773 INFO org.apache.nutch.parse.ParseSegment [LocalJobRunner Map Task Executor #0] Parsed (479ms): smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ 2024-10-31 13:05:41,797 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl [pool-6-thread-1] JobTracker metrics system already initialized! 2024-10-31 13:05:41,829 INFO org.apache.nutch.urlfilter.regex.RegexURLFilter [pool-6-thread-1] Reading urlfilter-regex rules file: regex-urlfilter.txt 2024-10-31 13:05:41,830 INFO org.apache.nutch.urlfilter.api.RegexURLFilterBase [pool-6-thread-1] Read 9 regex rules (org.apache.nutch.urlfilter.regex.RegexURLFilter) 2024-10-31 13:05:41,830 DEBUG org.apache.nutch.util.ObjectCache [pool-6-thread-1] No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-rbf-default.xml, hdfs-site.xml, hdfs-rbf-site.xml, file:/tmp/hadoop-hiran/mapred/local/localRunner/hiran/job_local261962044_0001/job_local261962044_0001.xml, instantiating a new object cache 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 33 (!) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 35 (#) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 36 ($) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 37 (%) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 38 (&) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 39 (') not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 40 (() not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 41 ()) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 42 (*) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 43 (+) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 44 (,) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 47 (/) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 58 (:) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 59 (;) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 61 (=) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 63 (?) not handled as escaped or unescaped 2024-10-31 13:05:41,832 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 64 (@) not handled as escaped or unescaped 2024-10-31 13:05:41,833 DEBUG org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] Character 92 (\) not handled as escaped or unescaped 2024-10-31 13:05:41,867 INFO org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer [pool-6-thread-1] can't find rules for scope 'outlink', using default 2024-10-31 13:05:42,215 INFO org.apache.nutch.parse.ParseSegment [main] ParseSegment: finished, elapsed: 1286 ms {code} > Nutch crawling inconsistent on URLs with userinfo > ------------------------------------------------- > > Key: NUTCH-3087 > URL: https://issues.apache.org/jira/browse/NUTCH-3087 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Reporter: Hiran Chaudhuri > Priority: Major > > I am trying to scan the URL > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > Note the userinfo 'hiran', which is used for authentication on the server. > (The smb plugin pulls credentials from another configuration file, but this > is irrelevant here). > The URL is fetched, parsed, updated in the crawldb and sent to the indexer. > So far so good. But the outlinks that are detected are of different quality: > some have the userinfo preserved, some are missing that information. > Dumping the segment I can see the below data. Note that some of the outlinks > start with smb://hi...@nas.fritz.box, while others start with > smb://nas.fritz.box. The impact is that on the next fetch run authentication > information is missing and the URLs cannot be fetched further. > > {code:java} > Recno:: 0 > URL:: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Tue Oct 29 22:56:58 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > Content:: > Version: -1 > url: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/ > base: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/. > contentType: text/html > metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0 > Content: > <html><head><title>Index of > /Documents/Hiran/Monitoring/</title></head><body><h1>Index of > /Documents/Hiran/Monitoring/</h1><pre><a href=".svn/">.svn/ Tue Oct 24 > 13:32:32 CEST 2017</a> > <a href="architektur.dia">architektur.dia Mon Feb 22 21:30:33 CET 2010</a> > <a href="architektur.dia%7E">architektur.dia~ Mon Feb 22 21:20:42 CET > 2010</a> > <a href="architektur.png">architektur.png Mon Feb 22 21:34:27 CET 2010</a> > <a href="deployment.dia">deployment.dia Mon Feb 22 22:56:15 CET 2010</a> > <a href="deployment.dia%7E">deployment.dia~ Mon Feb 22 22:51:21 CET > 2010</a> > <a href="deployment.png">deployment.png Mon Feb 22 23:00:34 CET 2010</a> > <a href="Monitoring+strategy.odt">Monitoring strategy.odt Fri Aug 01 > 13:38:04 CEST 2014</a> > </pre></body></html> > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /Documents/Hiran/Monitoring/ > Outlinks: 5 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: > architektur.dia Mon Feb 22 21:30:33 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor: > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: > deployment.dia Mon Feb 22 22:56:15 CET 2010 > outlink: toUrl: > smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor: > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > outlink: toUrl: > smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt > anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > Content Metadata: > nutch.segment.name = 20241029225708 > nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc > _fst_ = 33 > nutch.crawl.score = 1.0 > Parse Metadata: > CharEncodingForConversion = windows-1252 > OriginalCharEncoding = windows-1252 > language = en > CrawlDatum:: > Version: 7 > Status: 65 (signature) > Fetch time: Tue Oct 29 22:57:25 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 0.0 > Signature: a794c6675cb2f9e460e7771060ed2dfc > Metadata: > > CrawlDatum:: > Version: 7 > Status: 33 (fetch_success) > Fetch time: Tue Oct 29 22:57:17 CET 2024 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 86400 seconds (1 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1730239026566 > _pst_=success(1), lastModified=0 > Content-Type=text/html > ParseText:: > Index of /Documents/Hiran/Monitoring/ > Index of /Documents/Hiran/Monitoring/ > .svn/ Tue Oct 24 13:32:32 CEST 2017 > architektur.dia Mon Feb 22 21:30:33 CET 2010 > architektur.dia~ Mon Feb 22 21:20:42 CET 2010 > architektur.png Mon Feb 22 21:34:27 CET 2010 > deployment.dia Mon Feb 22 22:56:15 CET 2010 > deployment.dia~ Mon Feb 22 22:51:21 CET 2010 > deployment.png Mon Feb 22 23:00:34 CET 2010 > Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014 > {code} > > Addendum: It is ok to have only 5 outlinks from a document with 8 anchors. > The .svn and the two .png links are ignored. My regex-urlfilter.txt looks > like this: > {code:java} > # skip file: ftp: and mailto: urls > -^(?:file|ftp|mailto): > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$ > # skip URLs containing certain characters as probable queries, etc. > -[?*!=] > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > # skip version control internal files > -(?i)\.(?:git|svn|cvs)$ > # skip recycle bin URLs > -(?i)/%23recycle/$ > -/\.svn/ > -/\.git/ > # accept anything else > +. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)