Hi Rupert Thanks for your hints. The problem was my incorrect usage of curl instead of --data @file.jpg I have to use --data-binary @file.jpg.
Reto On Sun, Oct 7, 2012 at 10:38 AM, Rupert Westenthaler < [email protected]> wrote: > Hi Reto, > > Normally it is not a problem if a parsed content does not contain any > plain text. There is even a unit test for the TikaEngine that test > EXIF metadata extraction for JPEG images (see > TikaEngineTest#testExifMetadata). > > Because of that I assume that the library used by Tika does hove some > problem with your image. In fact TIKA-609 mentions a similar exception > and the first comment suggests an illegal char encoding as cause (what > might make sense, because this could cause a different number of bytes > to be read from the stream). > > I would suggest to directly test your image with Tika 1.2 and see if > you can reproduce the error > > best > Rupert > > On Sat, Oct 6, 2012 at 2:48 PM, Reto Bachmann-Gmür <[email protected]> > wrote: > > Hello > > > > I thought that adding an engine that extract XMP metadata and converts > EXIF > > data to XMP would be pretty straight forward (expecially since clerezza > > provides a bundle with such utilities). > > > > However I've noticed that the tika engina already processes jpegs but for > > the jpeg I've been testing it I get: > > > > <h3>Caused > > by:</h3><pre>org.apache.stanbol.enhancer.servicesapi.EngineException: > > Unable to convert ContentItem > > <urn:content-item-sha1-13b7a6ca2636d1e1e8d36b4bc69d623947a6acb7> > with > > mimeType 'image/jpeg' to plain text! > > at > > > org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:222) > > at > > > org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.processEvent(EnhancementJobHandler.java:259) > > at > > > org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.handleEvent(EnhancementJobHandler.java:181) > > at > > > org.apache.felix.eventadmin.impl.tasks.HandlerTaskImpl.execute(HandlerTaskImpl.java:88) > > at > > > org.apache.felix.eventadmin.impl.tasks.SyncDeliverTasks.execute(SyncDeliverTasks.java:221) > > at > > > org.apache.felix.eventadmin.impl.tasks.AsyncDeliverTasks$TaskExecuter.run(AsyncDeliverTasks.java:110) > > at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown > > Source) > > at java.lang.Thread.run(Thread.java:662) > > Caused by: org.apache.tika.exception.TikaException: Can't read JPEG > metadata > > at > > > org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104) > > at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > at > > > org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:220) > > ... 7 more > > Caused by: com.drew.imaging.jpeg.JpegProcessingException: segment size > > would extend beyond file stream length > > at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown > Source) > > at com.drew.imaging.jpeg.JpegSegmentReader.<init>(Unknown > Source) > > at > > > org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:94) > > ... 13 more > > </pre> > > <h3>Caused by:</h3><pre>org.apache.tika.exception.TikaException: Can't > read > > JPEG metadata > > at > > > org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104) > > at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > > > > Now its not surprising that a jpeg cannot be converted to plain text but > > why does tika attempts in the first place andy why can't the JPEG > metadata > > be read? > > > > Any ideas? > > > > Cheers, > > Reto > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
