[
https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368275#comment-15368275
]
Tim Allison commented on TIKA-1164:
-----------------------------------
For anyone stumbling across this issue. It is expected that the underlying
stream will have bytes read from it. If the underlying stream is not
resettable, then when you check available() after detection on the underlying
stream, it will be missing bytes. The key is to reuse the buffered
stream/TikaInputStream, not the underlying stream.
Not great:
{noformat}
Detector detector = TikaConfig.getDefaultConfig().getDetector();
File file = new File("testPDFVarious.pdf");
try (FileInputStream is = new FileInputStream(file)) {
try (InputStream tis = TikaInputStream.get(is)) {
System.out.println("length: " + file.length());
System.out.println("avail before: " + tis.available());
System.out.println("DETECTED: " + detector.detect(tis, new
Metadata()));
System.out.println("avail after tis: " + tis.available());
System.out.println("avail after is: " + is.available());
}
}
{noformat}
length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
avail after is: 139955
Better:
Even better, call TikaInputStream.get() directly on a file (if you're
processing files).
{noformat}
Detector detector = TikaConfig.getDefaultConfig().getDetector();
try (InputStream tis = TikaInputStream.get(file)) {
System.out.println("length: " + file.length());
System.out.println("avail before: " + tis.available());
System.out.println("DETECTED: " + detector.detect(tis, new
Metadata()));
System.out.println("avail after tis: " + tis.available());
}
{noformat}
length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
> InputStream get modified by content type detection
> --------------------------------------------------
>
> Key: TIKA-1164
> URL: https://issues.apache.org/jira/browse/TIKA-1164
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.4
> Environment: Windows 7 / Eclipse Kepler / Tomcat 7 / JavaSE 7
> Reporter: Joël Royer
> Priority: Blocker
>
> I'm using Tika for content type detection after file upload.
> After tika detection, file content is modified (not the same size compared to
> original uploaded file).
> Here is my code:
> {code}
> AutoDetectParser parser = new AutoDetectParser();
> Detector detector = parser.getDetector();
> Metadata md = new Metadata();
> md.add(Metadata.RESOURCE_NAME_KEY, uploadedFilename);
> md.add(Metadata.CONTENT_TYPE, uploadedFileContentType);
> MediaType type = detector.detect(new BufferedInputStream(is), md);
> {code}
> Before detection, file size is correct.
> After detection, file size is lower than original.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)