[ 
https://issues.apache.org/jira/browse/PDFBOX-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015572#comment-18015572
 ] 

Tilman Hausherr commented on PDFBOX-6054:
-----------------------------------------

There is a log message if unicode is not available, and the message occurs here 
but only for one glyph. With the fonts here the unicode is available, but it's 
wrong.

Re your question, you could do what tika is doing, which is to override 
showGlyph(), this is for 3.0:
{code:java}
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code,
                             Vector displacement) throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, displacement);
        String unicode = font.toUnicode(code);
        if (unicode == null || unicode.isEmpty()) {
            unmappedUnicodeCharsPerPage++;
            totalUnmappedUnicodeCharacters++;
        }
        totalCharsPerPage++;
        totalCharacters++;

        if (font.isDamaged()) {
            containsDamagedFont = true;
        }
        if (!font.isEmbedded()) {
            containsNonEmbeddedFont = true;
        }
    }
{code}


> Enable API support to check when text is scrambled and/or if some of the 
> unicode mapping warnings happen
> --------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6054
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6054
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 3.0.5 PDFBox
>         Environment: Linux / JDK 21 / Docker
> Windows / JDK 21
>            Reporter: Marc Reichman
>            Priority: Minor
>         Attachments: 7E32D4EAD8382000E24D9967C1913F6E.pdf
>
>
> With the attached PDF, there is plenty of gibberish in the text extraction. I 
> have seen other issues mention this, but in this particular case it displays 
> perfectly fine in Edge or Chrome. I have opened it in the pdf debugger but 
> it's hard to figure out what I'm looking at.
>  
> The pdftotext tool from xpdf generates the same. Interestingly, the pdffonts 
> tool does not show any fonts as "problem".
>  
> I understand this will happen and it's due to pdf generation bugs, not 
> including proper unicode translators, etc. but I am curious, could we check a 
> property or get a specific exception when unicode mapping is not available? 
> I'm not sure if that's overcorrective; i.e. unicode mapping failures is a way 
> of normal life.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to