Olivier Tavard created CONNECTORS-1660:
------------------------------------------

             Summary: Patch for MCF HTML extractor connector
                 Key: CONNECTORS-1660
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1660
             Project: ManifoldCF
          Issue Type: Improvement
          Components: HTML extractor
            Reporter: Olivier Tavard
         Attachments: patch_html_extractor_connector_02_12_2020.txt

Hello,

Here is a patch for the HTML extractor connector regarding the text extraction 
with or without HTML stripping : 
[^patch_html_extractor_connector_02_12_2020.txt]
 * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to 
define what HTML elements are allowed to inforce the security. In the code I 
set to “relaxed”:

This whitelist allows a full range of text and structural body HTML: a, b, 
blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, 
h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, 
sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul

(more details here : 
[https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()])

A future improvement of the code would be to add a new parameter on the 
interface to choose what whitelist to choose.

 
 * Extraction of text with stripping HTML activated : we keep only text nodes : 
all HTML will be stripped (same thing as before). The change is the Jsoup 
pretty print option is now set to false to keep line breaks.

 

Best regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to