[
https://issues.apache.org/jira/browse/NUTCH-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894544#comment-17894544
]
Hiran Chaudhuri edited comment on NUTCH-3087 at 10/31/24 12:00 PM:
-------------------------------------------------------------------
urlnormalizer? Maybe that is the case. How do I find out whether/which one is
active?
Aha! This is on my log when running the parse step:
{code:java}
2024-10-31 12:55:44,761 INFO org.apache.nutch.plugin.PluginManifestParser
[main] Plugins: looking in:
/home/hiran/NetBeansProjects/nutch/runtime/local/plugins
2024-10-31 12:55:44,833 INFO org.apache.nutch.plugin.PluginRepository [main]
Plugin Auto-activation mode: [true]
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Registered Plugins:
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Regex URL Filter (urlfilter-regex)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Html Parse Plug-in (parse-html)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
the nutch core extension points (nutch-extensionpoints)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Basic Indexing Filter (index-basic)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Anchor Indexing Filter (index-anchor)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Tika Parser Plug-in (parse-tika)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Basic URL Normalizer (urlnormalizer-basic)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Index Static (index-static)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Top Level Domain Plugin (tld)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Regex URL Filter Framework (lib-regex-filter)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Language Identification Parser/Filter (language-identifier)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Regex URL Normalizer (urlnormalizer-regex)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
CyberNeko HTML Parser (lib-nekohtml)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Subcollection indexing and query filter (subcollection)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
URL Meta Indexing Filter (urlmeta)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
OPIC Scoring Plug-in (scoring-opic)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Pass-through URL Normalizer (urlnormalizer-pass)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
SMB Protocol based on https://github.com/hierynomus/smbj (protocol-smb)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
More Indexing Filter (index-more)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
SolrIndexWriter (indexer-solr)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Creative Commons Plugins (creativecommons)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Replace Indexer (index-replace)
2024-10-31 12:55:44,834 INFO org.apache.nutch.plugin.PluginRepository [main]
Registered Extension-Points:
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Content Parser)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch URL Filter)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(HTML Parse Filter)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Scoring)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch URL Normalizer)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Publisher)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Exchange)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Protocol)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch URL Ignore Exemption Filter)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Index Writer)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Segment Merge Filter)
2024-10-31 12:55:44,835 INFO org.apache.nutch.plugin.PluginRepository [main]
(Nutch Indexing Filter)
2024-10-31 12:55:44,897 WARN org.apache.hadoop.util.NativeCodeLoader [main]
Unable to load native-hadoop library for your platform... using builtin-java
classes where applicable
2024-10-31 12:55:44,949 INFO org.apache.nutch.parse.ParseSegment [main]
ParseSegment: starting
2024-10-31 12:55:44,949 INFO org.apache.nutch.parse.ParseSegment [main]
ParseSegment: segment: crawl/segments/20241031125529
{code}
So it is confirmed: There are several normalizer plugins active:
* urlnormalizer-basic
* urlnormalizer-regex
* urlnormalizer-pass
was (Author: hiranchaudhuri):
urlnormalizer? Maybe that is the case. How do I find out whether/which one is
active?
> Nutch crawling inconsistent on URLs with userinfo
> -------------------------------------------------
>
> Key: NUTCH-3087
> URL: https://issues.apache.org/jira/browse/NUTCH-3087
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.21
> Reporter: Hiran Chaudhuri
> Priority: Major
>
> I am trying to scan the URL
> smb://[email protected]/Documents/Hiran/Monitoring/
> Note the userinfo 'hiran', which is used for authentication on the server.
> (The smb plugin pulls credentials from another configuration file, but this
> is irrelevant here).
> The URL is fetched, parsed, updated in the crawldb and sent to the indexer.
> So far so good. But the outlinks that are detected are of different quality:
> some have the userinfo preserved, some are missing that information.
> Dumping the segment I can see the below data. Note that some of the outlinks
> start with smb://[email protected], while others start with
> smb://nas.fritz.box. The impact is that on the next fetch run authentication
> information is missing and the URLs cannot be fetched further.
>
> {code:java}
> Recno:: 0
> URL:: smb://[email protected]/Documents/Hiran/Monitoring/
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Oct 29 22:56:58 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata:
> _ngt_=1730239026566
> Content::
> Version: -1
> url: smb://[email protected]/Documents/Hiran/Monitoring/
> base: smb://[email protected]/Documents/Hiran/Monitoring/.
> contentType: text/html
> metadata: nutch.segment.name=20241029225708 _fst_=33 nutch.crawl.score=1.0
> Content:
> <html><head><title>Index of
> /Documents/Hiran/Monitoring/</title></head><body><h1>Index of
> /Documents/Hiran/Monitoring/</h1><pre><a href=".svn/">.svn/ Tue Oct 24
> 13:32:32 CEST 2017</a>
> <a href="architektur.dia">architektur.dia Mon Feb 22 21:30:33 CET 2010</a>
> <a href="architektur.dia%7E">architektur.dia~ Mon Feb 22 21:20:42 CET
> 2010</a>
> <a href="architektur.png">architektur.png Mon Feb 22 21:34:27 CET 2010</a>
> <a href="deployment.dia">deployment.dia Mon Feb 22 22:56:15 CET 2010</a>
> <a href="deployment.dia%7E">deployment.dia~ Mon Feb 22 22:51:21 CET
> 2010</a>
> <a href="deployment.png">deployment.png Mon Feb 22 23:00:34 CET 2010</a>
> <a href="Monitoring+strategy.odt">Monitoring strategy.odt Fri Aug 01
> 13:38:04 CEST 2014</a>
> </pre></body></html>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /Documents/Hiran/Monitoring/
> Outlinks: 5
> outlink: toUrl:
> smb://[email protected]/Documents/Hiran/Monitoring/architektur.dia anchor:
> architektur.dia Mon Feb 22 21:30:33 CET 2010
> outlink: toUrl:
> smb://nas.fritz.box/Documents/Hiran/Monitoring/architektur.dia~ anchor:
> architektur.dia~ Mon Feb 22 21:20:42 CET 2010
> outlink: toUrl:
> smb://[email protected]/Documents/Hiran/Monitoring/deployment.dia anchor:
> deployment.dia Mon Feb 22 22:56:15 CET 2010
> outlink: toUrl:
> smb://nas.fritz.box/Documents/Hiran/Monitoring/deployment.dia~ anchor:
> deployment.dia~ Mon Feb 22 22:51:21 CET 2010
> outlink: toUrl:
> smb://[email protected]/Documents/Hiran/Monitoring/Monitoring+strategy.odt
> anchor: Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014
> Content Metadata:
> nutch.segment.name = 20241029225708
> nutch.content.digest = a794c6675cb2f9e460e7771060ed2dfc
> _fst_ = 33
> nutch.crawl.score = 1.0
> Parse Metadata:
> CharEncodingForConversion = windows-1252
> OriginalCharEncoding = windows-1252
> language = en
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Oct 29 22:57:25 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 0.0
> Signature: a794c6675cb2f9e460e7771060ed2dfc
> Metadata:
>
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Oct 29 22:57:17 CET 2024
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: null
> Metadata:
> _ngt_=1730239026566
> _pst_=success(1), lastModified=0
> Content-Type=text/html
> ParseText::
> Index of /Documents/Hiran/Monitoring/
> Index of /Documents/Hiran/Monitoring/
> .svn/ Tue Oct 24 13:32:32 CEST 2017
> architektur.dia Mon Feb 22 21:30:33 CET 2010
> architektur.dia~ Mon Feb 22 21:20:42 CET 2010
> architektur.png Mon Feb 22 21:34:27 CET 2010
> deployment.dia Mon Feb 22 22:56:15 CET 2010
> deployment.dia~ Mon Feb 22 22:51:21 CET 2010
> deployment.png Mon Feb 22 23:00:34 CET 2010
> Monitoring strategy.odt Fri Aug 01 13:38:04 CEST 2014
> {code}
>
> Addendum: It is ok to have only 5 outlinks from a document with 8 anchors.
> The .svn and the two .png links are ignored. My regex-urlfilter.txt looks
> like this:
> {code:java}
> # skip file: ftp: and mailto: urls
> -^(?:file|ftp|mailto):
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> # skip version control internal files
> -(?i)\.(?:git|svn|cvs)$
> # skip recycle bin URLs
> -(?i)/%23recycle/$
> -/\.svn/
> -/\.git/
> # accept anything else
> +.
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)