[
https://issues.apache.org/jira/browse/PDFBOX-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-371.
----------------------------------
Resolution: Duplicate
I believe that this has been solved by PDFBOX-1713 and PDFBOX-1357, which has a
similar solution. If it doesn't work for you, please reopen but attach a PDF
and explain what doesn't work for you.
> Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested
> fix)
> ---------------------------------------------------------------------------------
>
> Key: PDFBOX-371
> URL: https://issues.apache.org/jira/browse/PDFBOX-371
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: Java 1.5, OSX 10.5
> Reporter: Robert Baruch
> Priority: Minor
>
> When running text extraction on a PDF file that contains the soft hyphen
> character in the WinAnsiEncoding (that is, 0255), the text extractor
> incorrectly maps this as a space, when it should be a hyphen. As the PDF
> Reference 1.7 says in note 5 of table D.1:
> 'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning
> of this duplicate code is "soft hyphen," but it is typographically the same
> as hyphen.'
> The reason that a soft hyphen is typographically the same as hyphen is that a
> soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e.
> breaking a word across lines). Since the soft hyphen should only be put, by
> the PDF producer, at the end of a line to break a word, it stands to reason
> that the option to place a hyphen must be taken.
> I think I've traced the reason for the substitution to Encoding.getName,
> where because there is no mapping in the codeToName mapping for this code in
> WinAnsiEncoding, by default it returns "space".
> The fix is not as simple as adding an addCharacterEncoding( 0255,
> COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both
> the codeToName mapping AND the nameToCode mapping, which will overwrite the
> 055 nameToCode mapping.
> Adding this line:
> codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));
> to the end of the WinAnsiEncoding constructor seems to fix the issue.
--
This message was sent by Atlassian JIRA
(v6.2#6252)