[
https://issues.apache.org/jira/browse/PDFBOX-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017397#comment-13017397
]
Ramesh commented on PDFBOX-371:
-------------------------------
Hi Navendu,
I saw your solution for "soft hyphen in pdf". I have this issue at our place I
want implement your solution. But i am not a programming side. Would you please
let me know how to implement your solution? I do have Acrobat 6, 7, 8, 9
versions in both PC and Macintosh platforms.
Thanks in advance
Ramesh
> Soft Hyphen character not mapped to hyphen in WinAnsiEncoding (and suggested
> fix)
> ---------------------------------------------------------------------------------
>
> Key: PDFBOX-371
> URL: https://issues.apache.org/jira/browse/PDFBOX-371
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: Java 1.5, OSX 10.5
> Reporter: Robert Baruch
> Priority: Minor
>
> When running text extraction on a PDF file that contains the soft hyphen
> character in the WinAnsiEncoding (that is, 0255), the text extractor
> incorrectly maps this as a space, when it should be a hyphen. As the PDF
> Reference 1.7 says in note 5 of table D.1:
> 'The hyphen character is also encoded as 255 in WinAnsiEncoding. The meaning
> of this duplicate code is "soft hyphen," but it is typographically the same
> as hyphen.'
> The reason that a soft hyphen is typographically the same as hyphen is that a
> soft hyphen indicates that a hyphen MAY be placed here if necessary (i.e.
> breaking a word across lines). Since the soft hyphen should only be put, by
> the PDF producer, at the end of a line to break a word, it stands to reason
> that the option to place a hyphen must be taken.
> I think I've traced the reason for the substitution to Encoding.getName,
> where because there is no mapping in the codeToName mapping for this code in
> WinAnsiEncoding, by default it returns "space".
> The fix is not as simple as adding an addCharacterEncoding( 0255,
> COSName.getPDFName("hyphen")) to WinAnsiEncoding, because that will set both
> the codeToName mapping AND the nameToCode mapping, which will overwrite the
> 055 nameToCode mapping.
> Adding this line:
> codeToName.add( new Integer(0255), COSName.getPDFName("hyphen"));
> to the end of the WinAnsiEncoding constructor seems to fix the issue.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira