[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML

Julien Massiera (Jira) Wed, 21 Oct 2020 08:45:10 -0700


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218358#comment-17218358
 ]


Julien Massiera commented on CONNECTORS-1656:
---------------------------------------------

Hi [[email protected]],

The document produced identifies itself as XHTML. But even if it was HTML, the 
default HTML parser of Tika uses SAX to parse documents. 
 Here is the configuration of the Tika HTML parser (default configuration): 

HtmlParser

Class: org.apache.tika.parser.html.HtmlParser

Mime Types:

text/html
application/vnd.wap.xhtml+xm
application/x-asp
application/xhtml+xml

So as it handles html and xhtml, the processed files have to be XML valid anyway

> HTML extractor produces invalid XML
> -----------------------------------
>
>                 Key: CONNECTORS-1656
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1656
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: HTML extractor
>    Affects Versions: ManifoldCF 2.17
>            Reporter: Julien Massiera
>            Assignee: Karl Wright
>            Priority: Major
>
> The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' 
> option is disabled) but invalid XML (some tags like img do not have closing 
> tag), and in some cases it is problematic. For example, when Tika is used 
> behind, it processes the document as an XML document and most of the time a 
> parse exception is raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML

Reply via email to