[
https://issues.apache.org/jira/browse/TIKA-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ajay Vohra updated TIKA-584:
----------------------------
Attachment: JavaEE6Tutorial.pdf
Please try to parse this attached PDF file with example code below and the
resulting extracted text contains no spaces:
public class TestTikaExtractor {
public static void main(String[] args) {
try {
Tika tika = new Tika();
Metadata metadata = new Metadata();
Reader reader = tika.parse(new
FileInputStream(args[0]), metadata);
if(metadata != null) {
String[] names = metadata.names();
for(int i=0; i < names.length; i++) {
System.out.print("Metadata: host:"+
names[i] + ":"+ metadata.isMultiValued(names[i])+ "[" );
String[] values =
metadata.getValues(names[i]);
for(int j=0; j < values.length; j++) {
System.out.print(values[j]);
}
System.out.println( "]" );
}
}
PrintWriter pw = new PrintWriter(new File(args[1]));
char[] cbuf = new char[1024];
int nread = 0;
while((nread = reader.read(cbuf, 0, cbuf.length)) > 0) {
pw.print(new String(cbuf, 0, nread));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
> Tika parse of some PDF files removes all spaces between words
> -------------------------------------------------------------
>
> Key: TIKA-584
> URL: https://issues.apache.org/jira/browse/TIKA-584
> Project: Tika
> Issue Type: Bug
> Affects Versions: 0.8
> Environment: Windows XP 3, OpenSuse 11.2
> Reporter: Ajay Vohra
> Attachments: JavaEE6Tutorial.pdf
>
>
> In the case of some pdf files (not all), when Tika.parse(InputStream) method
> is used, the content extracted from the returned reader has all spaces
> removed. This only happens for some PDF files: An example where this happens
> is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files
> where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug
> remains.
> When PDFTextStripper is directly used, the extracted content is correct, with
> the spaces between words retained.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.