[jira] [Commented] (TIKA-1164) InputStream get modified by content type detection

Tim Allison (JIRA) Fri, 08 Jul 2016 12:33:24 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368275#comment-15368275
 ]


Tim Allison commented on TIKA-1164:
-----------------------------------

For anyone stumbling across this issue.  It is expected that the underlying 
stream will have bytes read from it.  If the underlying stream is not 
resettable, then when you check available() after detection on the underlying 
stream, it will be missing bytes.  The key is to reuse the buffered 
stream/TikaInputStream, not the underlying stream.  

Not great:
{noformat}
        Detector detector = TikaConfig.getDefaultConfig().getDetector();
        File file = new File("testPDFVarious.pdf");
        try (FileInputStream is = new FileInputStream(file)) {
            try (InputStream tis = TikaInputStream.get(is)) {
                System.out.println("length: " + file.length());
                System.out.println("avail before: " + tis.available());
                System.out.println("DETECTED: " + detector.detect(tis, new 
Metadata()));
                System.out.println("avail after tis: " + tis.available());
                System.out.println("avail after is: " + is.available());
            }
        }
{noformat}
length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
avail after is: 139955

Better:
Even better, call TikaInputStream.get() directly on a file (if you're 
processing files).
{noformat}
        Detector detector = TikaConfig.getDefaultConfig().getDetector();
        try (InputStream tis = TikaInputStream.get(file)) {
            System.out.println("length: " + file.length());
            System.out.println("avail before: " + tis.available());
            System.out.println("DETECTED: " + detector.detect(tis, new 
Metadata()));
            System.out.println("avail after tis: " + tis.available());
        }
{noformat}
length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491

> InputStream get modified by content type detection
> --------------------------------------------------
>
>                 Key: TIKA-1164
>                 URL: https://issues.apache.org/jira/browse/TIKA-1164
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.4
>         Environment: Windows 7 / Eclipse Kepler / Tomcat 7 / JavaSE 7
>            Reporter: Joël Royer
>            Priority: Blocker
>
> I'm using Tika for content type detection after file upload.
> After tika detection, file content is modified (not the same size compared to 
> original uploaded file).
> Here is my code:
> {code}
> AutoDetectParser parser = new AutoDetectParser();
> Detector detector = parser.getDetector();
> Metadata md = new Metadata();
> md.add(Metadata.RESOURCE_NAME_KEY, uploadedFilename);
> md.add(Metadata.CONTENT_TYPE, uploadedFileContentType);
> MediaType type = detector.detect(new BufferedInputStream(is), md);
> {code}
> Before detection, file size is correct.
> After detection, file size is lower than original.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1164) InputStream get modified by content type detection

Reply via email to