[
https://issues.apache.org/jira/browse/PDFBOX-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863515#comment-16863515
]
chunlinyao commented on PDFBOX-4572:
------------------------------------
The magic number is MS明朝 encoded in CP932
{code}
echo "0000 826c 8272 96BE 92A9" |xxd -r |iconv -f cp932
MS明朝{code}
PDF 1.6 Reference APPENDIX H Compatibility and Implementation Notes
Sections 3.2.4
{quote}5. In Acrobat 4.0 and earlier versions, a name object being treated as
text is
typically interpreted in a host platform encoding, which depends on the
operating system and the local language. For Asian languages, this
encoding may be something like Shift-JIS or Big Five. Consequently, it is
necessary to distinguish between names encoded this way and ones
encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its
use can usually be recognized. A name that does not conform to UTF-8
encoding rules can instead be interpreted according to host platform encoding.
{quote}
Are there any method to detect the magic host platform encoding?
> Font name not decoded correctly.
> --------------------------------
>
> Key: PDFBOX-4572
> URL: https://issues.apache.org/jira/browse/PDFBOX-4572
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 2.0.15
> Reporter: chunlinyao
> Priority: Minor
> Attachments: sample_ja.pdf
>
>
> The attached file encode font name in MS932, PDFBox decode it incorrectly.
> Maybe this file is malformed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]