Tim Allison created TIKA-3027:
---------------------------------
Summary: Consider using html parser instead of xml parser for epub
contents
Key: TIKA-3027
URL: https://issues.apache.org/jira/browse/TIKA-3027
Project: Tika
Issue Type: Task
Reporter: Tim Allison
Attachments: testEPUB_html.epub
We have a good number of files in our regression set whose content "xhtml"
files cause problems for the XML parser. Should we switch to the HTMLParser?
To name a few:
{noformat}
commoncrawl3/6H/6HAGP5DFUKFYPUAUBPZ6NX54LUT6H5YO
commoncrawl3/LR/LR53ZVY5VR4BILUK27LGKROTBMVQ4YMV
commoncrawl3/Q4/Q4F2HATL7V5A6AYDJKZYNXV4AU6NXRMX
commoncrawl3/7I/7I6CKCIX75V22UNG7YPUVL6O2F3WVUTF
commoncrawl3/PF/PFYKV55F57N46PQJXAPZDEXCGJ54W26N
commoncrawl3/QK/QKVFV2QCCPXCQT27ZKRTOTTA5PHLFLIE
commoncrawl3/XB/XBUNGEOTNUBZ4EDHIEXRR5NW2PWF4WNN
commoncrawl3/72/72CJJQCXYVNIBX6O2M2AEJOHUZJUK625 {noformat}
I'm attaching a 6HA... renamed.
The few that I've tried to open in iBooks cause errors in iBooks and don't open
at all. Will try a few other readers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)