[jira] [Commented] (TIKA-3027) Consider using html parser instead of xml parser for epub contents

Tim Allison (Jira) Fri, 17 Jan 2020 12:56:11 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018310#comment-17018310
 ]


Tim Allison commented on TIKA-3027:
-----------------------------------

FBReader has no problem with at least one of these files.

> Consider using html parser instead of xml parser for epub contents
> ------------------------------------------------------------------
>
>                 Key: TIKA-3027
>                 URL: https://issues.apache.org/jira/browse/TIKA-3027
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: testEPUB_html.epub
>
>
> We have a good number of files in our regression set whose content "xhtml" 
> files cause problems for the XML parser.  Should we switch to the HTMLParser?
>  
> To name a few:
> {noformat}
> commoncrawl3/6H/6HAGP5DFUKFYPUAUBPZ6NX54LUT6H5YO
> commoncrawl3/LR/LR53ZVY5VR4BILUK27LGKROTBMVQ4YMV
> commoncrawl3/Q4/Q4F2HATL7V5A6AYDJKZYNXV4AU6NXRMX
> commoncrawl3/7I/7I6CKCIX75V22UNG7YPUVL6O2F3WVUTF
> commoncrawl3/PF/PFYKV55F57N46PQJXAPZDEXCGJ54W26N
> commoncrawl3/QK/QKVFV2QCCPXCQT27ZKRTOTTA5PHLFLIE
> commoncrawl3/XB/XBUNGEOTNUBZ4EDHIEXRR5NW2PWF4WNN
> commoncrawl3/72/72CJJQCXYVNIBX6O2M2AEJOHUZJUK625 {noformat}
> I'm attaching a 6HA... renamed.
>  
> The few that I've tried to open in iBooks cause errors in iBooks and don't 
> open at all.  Will try a few other readers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3027) Consider using html parser instead of xml parser for epub contents

Reply via email to