[ 
https://issues.apache.org/jira/browse/TIKA-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-609.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
         Assignee: Jukka Zitting

Fixed in revision 1079837 by catching and simply ignoring problems with 
embedded XMP metadata. In this case, as noted in the error message, the 
embedded XMP stream could not be parsed due to an invalid character reference.

I also modified the way JPEG streams are consumed twice over, which was the 
source of the different behavior you encountered when accessing the file 
locally or over HTTP.

> IOException from jempbox
> ------------------------
>
>                 Key: TIKA-609
>                 URL: https://issues.apache.org/jira/browse/TIKA-609
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Erik Hetzner
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>
> {noformat}
> $ java -jar tika-app-0.9.jar ChateauFrontenacQC.jpg 
> [Fatal Error] :47:39: Character reference "&#x5" is an invalid XML character.
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from org.apache.tika.parser.jpeg.JpegParser@17353249
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> Caused by: java.io.IOException: Character reference "&#x5" is an invalid XML 
> character.
>       at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:100)
>       at org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:538)
>       at 
> org.apache.tika.parser.image.xmp.JempboxExtractor.parse(JempboxExtractor.java:59)
>       at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:69)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       ... 5 more
> {noformat}
> Interestingly, accessing via HTTP gives a different error:
> {noformat}
> $ java -jar tika-app-0.9.jar 
> http://www.aace.org/conf/cities/quebecCity/ChateauFrontenacQC.jpg 
> Exception in thread "main" org.apache.tika.exception.TikaException: Can't 
> read JPEG metadata
>       at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:92)
>       at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:66)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
> Caused by: com.drew.imaging.jpeg.JpegProcessingException: segment size would 
> extend beyond file stream length
>       at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown Source)
>       at com.drew.imaging.jpeg.JpegSegmentReader.<init>(Unknown Source)
>       at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown Source)
>       at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:87)
>       ... 7 more
> {noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to