Felix Zett created NUTCH-2318:
---------------------------------

             Summary: Text extraction in HtmlParser adds too much whitespace.
                 Key: NUTCH-2318
                 URL: https://issues.apache.org/jira/browse/NUTCH-2318
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.3.1
            Reporter: Felix Zett


In parse-html, org.apache.nutch.parse.html.HtmlParser will call 
DOMContentUtils.getText() to extract the text content. For every text node 
encountered in the document, the getTextHelper() function will first add a 
space character to the already extracted text and then the text content itself 
(stripped of excess whitespace). This means that parsing HTML such as

{{<p>behavi<em>ou</em>r</p>}}

will lead to this extracted text:

{{behavi ou r}}

I would have expected a parser not to add whitespace to content that visually 
(and actually) does not contain any in the first place. This applies to all 
similar semantic tags as well as {{<span>}}.

My naiive approach would be to remove the lines {{text = text.trim()}} and 
{{sb.append(' ')}}, but I'm aware that this will lead to bad parsing of stuff 
like {{<p>foo</p><p>bar</p>}}.

This is not an issue in parse-tika, since tika removes all "unimportant" tags 
beforehand. However, I'd like to keep using parse-html because I need to keep 
the document reasonably intact for parse filters applied later.

I know I could write a parse filter that will re-extract the text content, but 
this feels like a bug (or at least a shortcoming) in the ParseHtml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to