[
https://issues.apache.org/jira/browse/TIKA-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-609.
--------------------------------
Resolution: Fixed
Fix Version/s: 1.0
Assignee: Jukka Zitting
Fixed in revision 1079837 by catching and simply ignoring problems with
embedded XMP metadata. In this case, as noted in the error message, the
embedded XMP stream could not be parsed due to an invalid character reference.
I also modified the way JPEG streams are consumed twice over, which was the
source of the different behavior you encountered when accessing the file
locally or over HTTP.
> IOException from jempbox
> ------------------------
>
> Key: TIKA-609
> URL: https://issues.apache.org/jira/browse/TIKA-609
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Erik Hetzner
> Assignee: Jukka Zitting
> Fix For: 1.0
>
>
> {noformat}
> $ java -jar tika-app-0.9.jar ChateauFrontenacQC.jpg
> [Fatal Error] :47:39: Character reference "" is an invalid XML character.
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198:
> Illegal IOException from org.apache.tika.parser.jpeg.JpegParser@17353249
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> Caused by: java.io.IOException: Character reference "" is an invalid XML
> character.
> at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:100)
> at org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:538)
> at
> org.apache.tika.parser.image.xmp.JempboxExtractor.parse(JempboxExtractor.java:59)
> at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:69)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 5 more
> {noformat}
> Interestingly, accessing via HTTP gives a different error:
> {noformat}
> $ java -jar tika-app-0.9.jar
> http://www.aace.org/conf/cities/quebecCity/ChateauFrontenacQC.jpg
> Exception in thread "main" org.apache.tika.exception.TikaException: Can't
> read JPEG metadata
> at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:92)
> at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:66)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> Caused by: com.drew.imaging.jpeg.JpegProcessingException: segment size would
> extend beyond file stream length
> at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown Source)
> at com.drew.imaging.jpeg.JpegSegmentReader.<init>(Unknown Source)
> at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown Source)
> at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:87)
> ... 7 more
> {noformat}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira