[ 
https://issues.apache.org/jira/browse/TIKA-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767125#comment-15767125
 ] 

Tim Allison commented on TIKA-2211:
-----------------------------------

That does make sense.  Unless we find the epub xhtml/html is as nasty as real 
html, I'd prefer to leave that out.

After some kicking of tires, the solution appears to be simpler.  The 
EPubContentParser was adding a new XHTMLContentHandler for each chapter.  I 
_think_ this prevented the BodyContentHandler from working properly -- this is 
a filter that only passes on contents from within <body></body> 
elements...which prevents <style> and <script> types of things that show up in 
<head> from entering the "content" section.

Once I removed the xhtml content handler from the EPubContentParser, all seems 
to work, and only body elements are being added to the overall output.

What I can't figure out is why no one has complained that the ToXML option 
didn't appear to work...at least on our one test file.  That now does work.

Also, I turned on some tests for the iBooksParser.  There's a comment in the 
test code that it didn't use to work, but it seems to be working now even 
before I made the change...not sure what was going on there.

> ePub formatting instructions appear in plain text output
> --------------------------------------------------------
>
>                 Key: TIKA-2211
>                 URL: https://issues.apache.org/jira/browse/TIKA-2211
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>         Environment: I tested this on on Mac OSX 10.11.6 with Oracle JDK 
> 1.8.0_112.  The Tika stand-alone application was launched as follows:
> {code}
> java -jar tika-app-1.14.jar
> {code}
>            Reporter: Adam Carroll
>
> For some ePub files, format information appears in the plain text output 
> produced by Apache Tika.  For example the Tika stand-alone application shows 
> the following text for the file “Don Quijote de la Mancha - Miguel de 
> Cervantes.epub” (dowloaded 
> [here|http://www.literanda.com/don-quijote-de-la-mancha--miguel-de-cervantes--epub]):
> {code}
> /**/
>   p.sgc-2 {font-style: italic; text-align: right}
>   p.sgc-1 {text-align: justify;}
>   h3.sgc-3 {text-align: center;}
>   /**/
> Al duque de Béjar
> Marqués de Gibraleón, conde de Benalcázar y Bañares, vizconde de La Puebla de 
> Alcocer, señor de las villas de Capilla, Curiel y Burguillos
> En fe del buen acogimiento y honra que hace Vuestra Excelencia a toda suerte 
> de libros, como príncipe tan inclinado a favorecer las buenas artes, 
> mayormente las que por su nobleza no se abaten al servicio y granjerías del 
> vulgo, he determinado de sacar a luz El ingenioso hidalgo don Quijote de la 
> Mancha, al abrigo del clarísimo nombre de Vuestra Excelencia, a quien, con el 
> acatamiento que debo a tanta grandeza, suplico le reciba agradablemente en su 
> protección, para que a su sombra, aunque desnudo de aquel precioso ornamento 
> de elegancia y erudición de que suelen andar vestidas las obras que se 
> componen en las casas de los hombres que saben, ose parecer seguramente en el 
> juicio de algunos que, conteniéndose en los límites de su ignorancia, suelen 
> condenar con más rigor y menos justicia los trabajos ajenos; que, poniendo 
> los ojos la prudencia de Vuestra Excelencia en mi buen deseo, fío que no 
> desdeñará la cortedad de tan humilde servicio.
> {code}
> To reproduce this problem run the stand-alone version of Tika and open an 
> affected ePub file such as the one mentioned above.  Then go to View -> Plain 
> Text.  You should see the problem there.
> By the way, thanks for making Apache Tika a really useful library.  Keep up 
> the good work!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to