[ 
https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894547#comment-17894547
 ] 

Hiran Chaudhuri commented on NUTCH-3087:
----------------------------------------

With logging turned on to the highest I am now getting this. Not very talkative 
about the impact though...
{code:java}
2024-10-31 13:05:41,713 DEBUG org.apache.nutch.parse.ParseUtil [LocalJobRunner 
Map Task Executor #0] Parsing 
[smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/] with 
[org.apache.nutch.parse.html.HtmlParser@2422ff89]
2024-10-31 13:05:41,729 TRACE org.apache.nutch.util.EncodingDetector [parse-0] 
smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.: Choosing encoding: 
windows-1252 (default)
2024-10-31 13:05:41,729 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] 
Parsing...
2024-10-31 13:05:41,762 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] 
Meta tags for smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.: 
base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, 
refreshHref=null
 * general tags:
 * http-equiv tags:2024-10-31 13:05:41,762 TRACE 
org.apache.nutch.parse.html.HtmlParser [parse-0] Getting text...
2024-10-31 13:05:41,763 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] 
Getting title...
2024-10-31 13:05:41,763 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] 
Getting links...
2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0] 
found 8 outlinks in smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.svn/ anchor: 
.svn/ Tue Oct 24 13:32:32 CEST 2017
2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia 
anchor: architektur.dia Mon Feb 22 21:30:33 CET 2010
2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: 
smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia%7E anchor: 
architektur.dia~ Mon Feb 22 21:20:42 CET 2010
2024-10-31 13:05:41,764 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.png 
anchor: architektur.png Mon Feb 22 21:34:27 CET 2010
2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia 
anchor: deployment.dia Mon Feb 22 22:56:15 CET 2010
2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: 
smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia%7E anchor: 
deployment.dia~ Mon Feb 22 22:51:21 CET 2010
2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.png 
anchor: deployment.png Mon Feb 22 23:00:34 CET 2010
2024-10-31 13:05:41,765 TRACE org.apache.nutch.parse.html.HtmlParser [parse-0]  
 -> toUrl: 
smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt 
anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014
2024-10-31 13:05:41,772 INFO org.apache.nutch.crawl.SignatureFactory 
[LocalJobRunner Map Task Executor #0] Using Signature impl: 
org.apache.nutch.crawl.MD5Signature
2024-10-31 13:05:41,773 INFO org.apache.nutch.parse.ParseSegment 
[LocalJobRunner Map Task Executor #0] Parsed (479ms): 
smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
2024-10-31 13:05:41,797 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl 
[pool-6-thread-1] JobTracker metrics system already initialized!
2024-10-31 13:05:41,829 INFO org.apache.nutch.urlfilter.regex.RegexURLFilter 
[pool-6-thread-1] Reading urlfilter-regex rules file: regex-urlfilter.txt
2024-10-31 13:05:41,830 INFO org.apache.nutch.urlfilter.api.RegexURLFilterBase 
[pool-6-thread-1] Read 9 regex rules 
(org.apache.nutch.urlfilter.regex.RegexURLFilter)
2024-10-31 13:05:41,830 DEBUG org.apache.nutch.util.ObjectCache 
[pool-6-thread-1] No object cache found for conf=Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-rbf-default.xml, 
hdfs-site.xml, hdfs-rbf-site.xml, 
file:/tmp/hadoop-hiran/mapred/local/localRunner/hiran/job_local261962044_0001/job_local261962044_0001.xml,
 instantiating a new object cache
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 33 (!) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 35 (#) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 36 ($) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 37 (%) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 38 (&) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 39 (') not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 40 (() not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 41 ()) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 42 (*) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 43 (+) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 44 (,) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 47 (/) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 58 (:) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 59 (;) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 61 (=) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 63 (?) not handled as escaped or unescaped
2024-10-31 13:05:41,832 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 64 (@) not handled as escaped or unescaped
2024-10-31 13:05:41,833 DEBUG 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer [pool-6-thread-1] 
Character 92 (\) not handled as escaped or unescaped
2024-10-31 13:05:41,867 INFO 
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer [pool-6-thread-1] 
can't find rules for scope 'outlink', using default
2024-10-31 13:05:42,215 INFO org.apache.nutch.parse.ParseSegment [main] 
ParseSegment: finished, elapsed: 1286 ms
 {code}

> Nutch crawling inconsistent on URLs with userinfo
> -------------------------------------------------
>
>                 Key: NUTCH-3087
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3087
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.21
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> I am trying to scan the URL
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> Note the userinfo 'hiran', which is used for authentication on the server. 
> (The smb plugin pulls credentials from another configuration file, but this 
> is irrelevant here).
> The URL is fetched, parsed, updated in the crawldb and sent to the indexer. 
> So far so good. But the outlinks that are detected are of different quality: 
> some have the userinfo preserved, some are missing that information.
> Dumping the segment I can see the below data. Note that some of the outlinks 
> start with smb://hi...@nas.fritz.box, while others start with 
> smb://nas.fritz.box. The impact is that on the next fetch run authentication 
> information is missing and the URLs cannot be fetched further.
>  
> {code:java}
> Recno:: 0
> URL:: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Oct 29 22:56:58 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata: 
>      _ngt_=1730239026566
> Content::
> Version: -1
> url: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/
> base: smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/.
> contentType: text/html
> metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0
> Content:
> <html><head><title>Index of 
> /Documents/Hiran/Monitoring/</title></head><body><h1>Index of 
> /Documents/Hiran/Monitoring/</h1><pre><a href=".svn/">.svn/    Tue Oct 24 
> 13:32:32 CEST 2017</a>
> <a href="architektur.dia">architektur.dia    Mon Feb 22 21:30:33 CET 2010</a>
> <a href="architektur.dia%7E">architektur.dia~    Mon Feb 22 21:20:42 CET 
> 2010</a>
> <a href="architektur.png">architektur.png    Mon Feb 22 21:34:27 CET 2010</a>
> <a href="deployment.dia">deployment.dia    Mon Feb 22 22:56:15 CET 2010</a>
> <a href="deployment.dia%7E">deployment.dia~    Mon Feb 22 22:51:21 CET 
> 2010</a>
> <a href="deployment.png">deployment.png    Mon Feb 22 23:00:34 CET 2010</a>
> <a href="Monitoring+strategy.odt">Monitoring strategy.odt    Fri Aug 01 
> 13:38:04 CEST 2014</a>
> </pre></body></html>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /Documents/Hiran/Monitoring/
> Outlinks: 5
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia anchor: 
> architektur.dia Mon Feb 22 21:30:33 CET 2010
>   outlink: toUrl: 
> smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor: 
> architektur.dia~ Mon Feb 22 21:20:42 CET 2010
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia anchor: 
> deployment.dia Mon Feb 22 22:56:15 CET 2010
>   outlink: toUrl: 
> smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor: 
> deployment.dia~ Mon Feb 22 22:51:21 CET 2010
>   outlink: toUrl: 
> smb://hi...@nas.fritz.box/Documents/Hiran/Monitoring/Monitoring+strategy.odt 
> anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014
> Content Metadata:
>   nutch.segment.name = 20241029225708
>   nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc
>   _fst_ = 33
>   nutch.crawl.score = 1.0
> Parse Metadata:
>   CharEncodingForConversion = windows-1252
>   OriginalCharEncoding = windows-1252
>   language = en
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Oct 29 22:57:25 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 0.0
> Signature: a794c6675cb2f9e460e7771060ed2dfc
> Metadata: 
>  
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Oct 29 22:57:17 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata: 
>      _ngt_=1730239026566
>     _pst_=success(1), lastModified=0
>     Content-Type=text/html
> ParseText::
> Index of /Documents/Hiran/Monitoring/
> Index of /Documents/Hiran/Monitoring/
> .svn/ Tue Oct 24 13:32:32 CEST 2017
> architektur.dia Mon Feb 22 21:30:33 CET 2010
> architektur.dia~ Mon Feb 22 21:20:42 CET 2010
> architektur.png Mon Feb 22 21:34:27 CET 2010
> deployment.dia Mon Feb 22 22:56:15 CET 2010
> deployment.dia~ Mon Feb 22 22:51:21 CET 2010
> deployment.png Mon Feb 22 23:00:34 CET 2010
> Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014
>  {code}
>  
> Addendum: It is ok to have only 5 outlinks from a document with 8 anchors. 
> The .svn and the two .png links are ignored. My regex-urlfilter.txt looks 
> like this:
> {code:java}
> # skip file: ftp: and mailto: urls
> -^(?:file|ftp|mailto):
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> # skip version control internal files
> -(?i)\.(?:git|svn|cvs)$
> # skip recycle bin URLs
> -(?i)/%23recycle/$
> -/\.svn/
> -/\.git/
> # accept anything else
> +.
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to