[
https://issues.apache.org/jira/browse/PDFBOX-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057046#comment-16057046
]
Christopher Creutzig edited comment on PDFBOX-3833 at 6/21/17 6:20 AM:
-----------------------------------------------------------------------
Thanks, that makes sense.
ー is a diacritic in the sense that it changes the pronunciation of the previous
mora and does not have a sound of its own.
But in terms of layout, as far as I know, it behaves just like any other
character (except it is the only kana rotated for vertical text). I.e., in
terms of layout, it does not behave like a diacritic, the way ¨ or ˆ do.
Seems to me that in the context of an algorithm concerned with character
placement as opposed to transliteration or text-to-speech, ー should indeed not
be regarded a diacritic.
was (Author: ccreutzig):
ー is a diacritic in the sense that it changes the pronunciation of the previous
mora and does not have a sound of its own.
But in terms of layout, as far as I know, it behaves just like any other
character (except it is the only kana rotated for vertical text). I.e., in
terms of layout, it does not behave like a diacritic, the way ¨ or ˆ do.
Seems to me that in the context of an algorithm concerned with character
placement as opposed to transliteration or text-to-speech, ー should indeed not
be regarded a diacritic.
> Characters in wrong order
> -------------------------
>
> Key: PDFBOX-3833
> URL: https://issues.apache.org/jira/browse/PDFBOX-3833
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.5
> Reporter: Christopher Creutzig
> Attachments: ML_mathworks_unc2.pdf, PDFBOX-3833-reduced.pdf
>
>
> The attached pdf file (which is page 3 of
> https://jp.mathworks.com/tagteam/89688_93050v00_JP_machine_learning_section1_ebook.pdf)
> shows multiple problems when reading with PDFBox in standard settings. This
> bug report in particular is about the Katakana ー being misplaced.
> In the text block on the left, the second line starts with ターン.
> PDFTextStripper.getText returns text starting with タ ンー (i.e., adding a space
> after the first character and swapping the second and third one). This effect
> also happens at other places in the (complete) file.
> The PDF itself at this point has [<03BB>43.9 <0294>156 <03EF>-24.5 ...]TJ,
> listing the characters in the proper order. Copy&paste using Apple's
> Preview.App also preserves that order.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]