[
https://issues.apache.org/jira/browse/PDFBOX-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919787#comment-13919787
]
John Hewson edited comment on PDFBOX-1956 at 3/4/14 7:15 PM:
-------------------------------------------------------------
{quote}
Do you know how I can check problem in PDF (like this) ? Working with PDFBOX is
possible to check it ?
{quote}
We see PDFs like this fairly often, the problem is that the text embedded in
the PDF is perfectly valid, it's just that the font's encoding is meaningless
to a human. The embedded font maps the character to a glyph which is
obviously the letter "P" but we have no way to know this, as the glyph claims
to be .
To detect PDFs with this problem, you could try
https://code.google.com/p/language-detection/ and see if the language
identified is what you were expecting. Let me know if you try this and it works.
was (Author: jahewson):
{quote}
Do you know how I can check problem in PDF (like this) ? Working with PDFBOX is
possible to check it ?
{quote}
We see PDFs like this fairly often, the problem is that the text embedded in
PDF is perfectly valid, it's just that the font's encoding is meaningless to a
human. The embedded font maps the character to a glyph which is obviously the
letter "P" but we have no way to know this, from our point of view the glyph
claims to be .
To detect PDFs with this problem, you could try
https://code.google.com/p/language-detection/ and see if the language
identified is what you were expecting. Let me know if you try this and it works.
> Wrong character on conversion PDF to TXT
> ----------------------------------------
>
> Key: PDFBOX-1956
> URL: https://issues.apache.org/jira/browse/PDFBOX-1956
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.4
> Environment: Windows
> Reporter: Vicente
> Priority: Minor
> Labels: parser
> Attachments: example b.pdf, itext_pdfabc-sample.pdf
>
>
> I am trying to convert PDF to TXT and some PDF, after converted, the String
> present wrong character. Could be UNICODE problem ? Can somebody help me ?
> I oberved that the problem when try to convert PDF, created by PDFCreator, in
> Text. The character are wrong. Any suggesting ?
> the code
> public class PDFTextParser {
>
> PDFParser parser;
> String parsedText;
> PDFTextStripper pdfStripper;
> PDDocument pdDoc;
> COSDocument cosDoc;
> PDDocumentInformation pdDocInfo;
>
> // PDFTextParser Constructor
> public PDFTextParser() {
> }
>
> // Extract text from PDF Document
> public String pdftoText(String fileName) {
>
> System.out.println("Parsing text from PDF file " + fileName + "....");
> File f = new File(fileName);
>
> if (!f.isFile()) {
> System.out.println("File " + fileName + " does not exist.");
> return null;
> }
>
> try {
> parser = new PDFParser(new FileInputStream(f));
> } catch (Exception e) {
> System.out.println("Unable to open PDF Parser.");
> return null;
> }
>
> try {
> parser.parse();
> cosDoc = parser.getDocument();
> pdfStripper = new PDFTextStripper();
> pdDoc = new PDDocument(cosDoc);
> parsedText = pdfStripper.getText(pdDoc);
> } catch (Exception e) {
> System.out.println("An exception occured in parsing the PDF
> Document.");
> e.printStackTrace();
> try {
> if (cosDoc != null) cosDoc.close();
> if (pdDoc != null) pdDoc.close();
> } catch (Exception e1) {
> e.printStackTrace();
> }
> return null;
> }
> System.out.println("Done.");
> return parsedText;
> }
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)