[jira] [Commented] (NUTCH-2769) Nutch 1.15 unable to parse certain outlinks

Sebastian Nagel (Jira) Thu, 27 Feb 2020 08:22:18 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046777#comment-17046777
 ]


Sebastian Nagel commented on NUTCH-2769:
----------------------------------------

Actually, the fragment from the original HTML document
{code:html}
<p><a href="congress_avenue_lighting_improvements.asp" title="Congress Avenue 
Lighting Improvements" class="bodylinksBold"><img 
src="images/icon-north-pbc-sm.png" alt="Northern Palm Beach County" width="40" 
height="40" hspace="5" border="0" align="middle" /><strong>Congress Avenue 
Lighting Improvements</strong></a></p>
{code}
is "seen" by parse-html as
{code:html}
<P/>
<A class="bodylinksBold" href="congress_avenue_lighting_improvements.asp" 
title="Congress Avenue Lighting Improvements"/>
<IMG align="middle" alt="Northern Palm Beach County" border="0" height="40" 
hspace="5" src="images/icon-north-pbc-sm.png" width="40"/>
<STRONG>Congress Avenue Lighting Improvements</STRONG>
<P/>
{code}
while parse-tika normalizes the fragment to
{code:html}
<P>
<A href="congress_avenue_lighting_improvements.asp" shape="rect"><IMG 
alt="Northern Palm Beach County" height="40" src="images/icon-north-pbc-sm.png" 
width="40">Congress Avenue Lighting Improvements</A>
</P>
{code}
(see NUTCH-2772 to easily dump the serialized DOM tree)

The class 
[DOMContentUtils|https://github.com/apache/nutch/blob/ebc215222dc044bd273812cfc9050973e479abbc/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L311]
 includes heuristics to throw away empty links (no anchor text). There is or 
has been surely a reason for this. So, it's not clear how we should fix this. 
The HTML parsing libraries (neko or tagsoup) the plugin parse-html is based on, 
are out of maintenance since many years.

> Nutch 1.15 unable to parse certain outlinks 
> --------------------------------------------
>
>                 Key: NUTCH-2769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2769
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15, 1.16
>            Reporter: Prajeeth Emanuel
>            Priority: Major
>
> Nutch is unable to parse certain outlinks in pages. 
> For example:
> Crawling [http://d4fdot.com/pbfdot/PBC-North_index.asp] does not parse the 
> outlinks: 
> [congress_avenue_lighting_improvements.asp|http://www.d4fdot.com/pbfdot/congress_avenue_lighting_improvements.asp]
> [blue_heron_boulevard_bridge_fender_replacement.asp|http://www.d4fdot.com/pbfdot/blue_heron_boulevard_bridge_fender_replacement.asp]
> [indiantown_road_intersection_improvements.asp|http://www.d4fdot.com/pbfdot/indiantown_road_intersection_improvements.asp]
>  
> Crawling [http://www.d4fdot.com/pbfdot/index.asp] however, parses 
> [congress_avenue_lighting_improvements.asp|http://www.d4fdot.com/pbfdot/congress_avenue_lighting_improvements.asp]
>  correctly even though the Anchor element is structured similarly. 
>  
> URL filters and normalizers have been modified to barely operate and no URLs 
> or outlinks are being ignored in the current config and the error still 
> occurs. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2769) Nutch 1.15 unable to parse certain outlinks

Reply via email to