[ 
https://issues.apache.org/jira/browse/NUTCH-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522390#comment-16522390
 ] 

ASF GitHub Bot commented on NUTCH-2611:
---------------------------------------

YossiTamari opened a new pull request #354: NUTCH-2611: Add line-breaks when 
parsing HTML block-level elements
URL: https://github.com/apache/nutch/pull/354
 
 
   When the configuration property parser.html.line.separators contains a list 
of tags, a newline is added before and after the text content of this tag.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Add line-breaks when parsing HTML block-level elements
> ------------------------------------------------------
>
>                 Key: NUTCH-2611
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2611
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Yossi Tamari
>            Priority: Major
>
> Currently, the HTML and Tika parser only add newlines following text-nodes 
> that contain only whitespaces (e.g </span> <span>), but not based on what the 
> tags are, so for example a </div><div> will not add a new line.
> While some applications do not differentiate between a space and a new line, 
> many others see the semantic difference (two following words in the same 
> sentence are "near", but in separate sentences they are not).
> I believe adding newlines after block-level HTML elements, while not a 
> panacea, will be an improvement on the current behavior.
> NUTCH-2318 is related to this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to