[jira] Updated: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Ajay Vohra (JIRA) Sun, 16 Jan 2011 19:46:10 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ajay Vohra updated TIKA-584:
----------------------------

    Attachment: JavaEE6Tutorial.pdf

Please try to parse this attached PDF file with example code below and the 
resulting extracted text contains no spaces:

public class TestTikaExtractor {

        public static void main(String[] args) {
                try {
                        Tika tika = new Tika();
                        Metadata metadata = new Metadata();
                        Reader reader = tika.parse(new 
FileInputStream(args[0]), metadata);
                        
                        if(metadata != null) {
                                String[] names = metadata.names();
                                for(int i=0; i < names.length; i++) {
                                        System.out.print("Metadata: host:"+ 
names[i] + ":"+ metadata.isMultiValued(names[i])+  "[" );
                                        
                                        String[] values = 
metadata.getValues(names[i]);
                                        
                                        for(int j=0; j < values.length; j++) {
                                                System.out.print(values[j]);
                                        }
                                        
                                        System.out.println( "]" );
                                        
                                }
                        }
                        
                        PrintWriter pw = new PrintWriter(new File(args[1]));
                        char[]  cbuf = new char[1024];
                        int nread = 0;
                        while((nread = reader.read(cbuf, 0, cbuf.length)) > 0) {
                                pw.print(new String(cbuf, 0, nread));
                        }
                        
                        
                        
                } catch (Exception e) {
                        e.printStackTrace();
                }
        }
}

> Tika parse of some PDF files removes all spaces between words
> -------------------------------------------------------------
>
>                 Key: TIKA-584
>                 URL: https://issues.apache.org/jira/browse/TIKA-584
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: Windows XP 3, OpenSuse 11.2
>            Reporter: Ajay Vohra
>         Attachments: JavaEE6Tutorial.pdf
>
>
> In the case of some pdf files (not all), when Tika.parse(InputStream) method 
> is used, the content extracted from the returned reader has all spaces 
> removed. This only happens for some PDF files: An example where this happens 
> is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files 
> where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug 
> remains.
> When PDFTextStripper is directly used, the extracted content is correct, with 
> the spaces between words retained.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-584) Tika parse of some PDF files removes all spaces between words

Reply via email to