[ https://issues.apache.org/jira/browse/NUTCH-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046777#comment-17046777 ]
Sebastian Nagel commented on NUTCH-2769: ---------------------------------------- Actually, the fragment from the original HTML document {code:html} <p><a href="congress_avenue_lighting_improvements.asp" title="Congress Avenue Lighting Improvements" class="bodylinksBold"><img src="images/icon-north-pbc-sm.png" alt="Northern Palm Beach County" width="40" height="40" hspace="5" border="0" align="middle" /><strong>Congress Avenue Lighting Improvements</strong></a></p> {code} is "seen" by parse-html as {code:html} <P/> <A class="bodylinksBold" href="congress_avenue_lighting_improvements.asp" title="Congress Avenue Lighting Improvements"/> <IMG align="middle" alt="Northern Palm Beach County" border="0" height="40" hspace="5" src="images/icon-north-pbc-sm.png" width="40"/> <STRONG>Congress Avenue Lighting Improvements</STRONG> <P/> {code} while parse-tika normalizes the fragment to {code:html} <P> <A href="congress_avenue_lighting_improvements.asp" shape="rect"><IMG alt="Northern Palm Beach County" height="40" src="images/icon-north-pbc-sm.png" width="40">Congress Avenue Lighting Improvements</A> </P> {code} (see NUTCH-2772 to easily dump the serialized DOM tree) The class [DOMContentUtils|https://github.com/apache/nutch/blob/ebc215222dc044bd273812cfc9050973e479abbc/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L311] includes heuristics to throw away empty links (no anchor text). There is or has been surely a reason for this. So, it's not clear how we should fix this. The HTML parsing libraries (neko or tagsoup) the plugin parse-html is based on, are out of maintenance since many years. > Nutch 1.15 unable to parse certain outlinks > -------------------------------------------- > > Key: NUTCH-2769 > URL: https://issues.apache.org/jira/browse/NUTCH-2769 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.15, 1.16 > Reporter: Prajeeth Emanuel > Priority: Major > > Nutch is unable to parse certain outlinks in pages. > For example: > Crawling [http://d4fdot.com/pbfdot/PBC-North_index.asp] does not parse the > outlinks: > [congress_avenue_lighting_improvements.asp|http://www.d4fdot.com/pbfdot/congress_avenue_lighting_improvements.asp] > [blue_heron_boulevard_bridge_fender_replacement.asp|http://www.d4fdot.com/pbfdot/blue_heron_boulevard_bridge_fender_replacement.asp] > [indiantown_road_intersection_improvements.asp|http://www.d4fdot.com/pbfdot/indiantown_road_intersection_improvements.asp] > > Crawling [http://www.d4fdot.com/pbfdot/index.asp] however, parses > [congress_avenue_lighting_improvements.asp|http://www.d4fdot.com/pbfdot/congress_avenue_lighting_improvements.asp] > correctly even though the Anchor element is structured similarly. > > URL filters and normalizers have been modified to barely operate and no URLs > or outlinks are being ignored in the current config and the error still > occurs. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)