[
https://issues.apache.org/jira/browse/NUTCH-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524034#comment-16524034
]
ASF GitHub Bot commented on NUTCH-2611:
---------------------------------------
sebastian-nagel commented on issue #354: NUTCH-2611: Add line-breaks when
parsing HTML block-level elements
URL: https://github.com/apache/nutch/pull/354#issuecomment-400400990
+1 lgtm. The plain-text layout is now indeed more readable - line breaks
after head lines, <p>, etc. Will commit soon if there are no objections.
Thanks, @YossiTamari!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Add line-breaks when parsing HTML block-level elements
> ------------------------------------------------------
>
> Key: NUTCH-2611
> URL: https://issues.apache.org/jira/browse/NUTCH-2611
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.14
> Reporter: Yossi Tamari
> Priority: Major
>
> Currently, the HTML and Tika parser only add newlines following text-nodes
> that contain only whitespaces (e.g </span> <span>), but not based on what the
> tags are, so for example a </div><div> will not add a new line.
> While some applications do not differentiate between a space and a new line,
> many others see the semantic difference (two following words in the same
> sentence are "near", but in separate sentences they are not).
> I believe adding newlines after block-level HTML elements, while not a
> panacea, will be an improvement on the current behavior.
> NUTCH-2318 is related to this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)