[ 
https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631841#comment-17631841
 ] 

Tilman Hausherr commented on PDFBOX-5540:
-----------------------------------------

proposed change, speculates that if there is an encoding with differences then 
the workaround shouldn't be used
{code:java}
if (cmapName.contains("Identity") //
        || ordering.contains("Identity") //
        || COSName.IDENTITY_H.equals(encoding) //
        || COSName.IDENTITY_V.equals(encoding))
{
    COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
    if (encodingDict == null || !encodingDict.containsKey(COSName.DIFFERENCES))
    {
        // assume that if encoding is identity, then the reverse is also true
        cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
        LOG.warn("Using predefined identity CMap instead");
    }
} {code}

> export:text creates jibberish / malformed output
> ------------------------------------------------
>
>                 Key: PDFBOX-5540
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5540
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.0 PDFBox
>         Environment: Same on Windows, Linux and macOS
>            Reporter: Alfons
>            Priority: Minor
>         Attachments: test.pdf, test.txt
>
>
> Using PDFBox as part of Tika and having issues with some PDFs outputting 
> unreadable content. Copying text from Adobe / macOS Preview / Browsers works 
> as expected.
> I have also tried "re-encoding" the PDF by editing and saving it with 
> Acrobat, thinking it could be an issue with their original PDF creator and 
> using pdfbox with different encodings, but output mostly remained unchanged.
> I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
> {code:java}
> root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf          
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to