Re: Unicode codepoint conversions

2020-11-20 Thread Constantine Dokolas
Get the code from
https://gist.github.com/cdokolas/8845724f8f4c0335dadfbc6f0c6afe0b
There is also a resulting PDF having a "ToUnicode" object that has the
"different" codepoints, but I don't know how to send it to you.
Note that the font I'm using is Noto Sans CJK ("NotoSansCJKsc-Regular"),
loaded from a resource.

Thanks in advance,
Constantine

--
There is a computer disease that anybody who works with computers knows
about. It's a very serious disease and it interferes completely with the
work. The trouble with computers is that you 'play' with them!
- Richard P. Feynman


On Wed, Nov 18, 2020 at 3:58 PM sahy...@fileaffairs.de <
sahy...@fileaffairs.de> wrote:

>
> Am Mittwoch, den 18.11.2020, 13:58 +0200 schrieb Constantine Dokolas:
> > I noticed that writing some codepoints to a PDF and then reading back
> > the
> > text from the generated PDF (via PDFTextStripper), I see some
> > conversions
> > happening. For example, the simple hyphen character (0x2D, "HYPHEN-
> > MINUS")
> > gets converted to a non-breaking hyphen (0x2011, "NON-BREAKING
> > HYPHEN").
> >
> > Since I'm writing unit tests to verify that everything gets written
> > correctly in the PDF from my end (PDF generation), I need to know
> > why, when
> > and how these conversions take place (I first noticed them while
> > writing
> > some CJK codepoints). Any suggestions/pointers?
> >
>
> Could you share a code snippet how you are writing/retrieving the data.
>
> BR
> Maruan
>
> > Constantine
> > --
> > There is a computer disease that anybody who works with computers
> > knows
> > about. It's a very serious disease and it interferes completely with
> > the
> > work. The trouble with computers is that you 'play' with them!
> > - Richard P. Feynman
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>


Re: Unicode codepoint conversions

2020-11-18 Thread sahy...@fileaffairs.de


Am Mittwoch, den 18.11.2020, 13:58 +0200 schrieb Constantine Dokolas:
> I noticed that writing some codepoints to a PDF and then reading back
> the
> text from the generated PDF (via PDFTextStripper), I see some
> conversions
> happening. For example, the simple hyphen character (0x2D, "HYPHEN-
> MINUS")
> gets converted to a non-breaking hyphen (0x2011, "NON-BREAKING
> HYPHEN").
> 
> Since I'm writing unit tests to verify that everything gets written
> correctly in the PDF from my end (PDF generation), I need to know
> why, when
> and how these conversions take place (I first noticed them while
> writing
> some CJK codepoints). Any suggestions/pointers?
> 

Could you share a code snippet how you are writing/retrieving the data.

BR
Maruan 

> Constantine
> --
> There is a computer disease that anybody who works with computers
> knows
> about. It's a very serious disease and it interferes completely with
> the
> work. The trouble with computers is that you 'play' with them!
> - Richard P. Feynman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Unicode codepoint conversions

2020-11-18 Thread Constantine Dokolas
I noticed that writing some codepoints to a PDF and then reading back the
text from the generated PDF (via PDFTextStripper), I see some conversions
happening. For example, the simple hyphen character (0x2D, "HYPHEN-MINUS")
gets converted to a non-breaking hyphen (0x2011, "NON-BREAKING HYPHEN").

Since I'm writing unit tests to verify that everything gets written
correctly in the PDF from my end (PDF generation), I need to know why, when
and how these conversions take place (I first noticed them while writing
some CJK codepoints). Any suggestions/pointers?

Constantine
--
There is a computer disease that anybody who works with computers knows
about. It's a very serious disease and it interferes completely with the
work. The trouble with computers is that you 'play' with them!
- Richard P. Feynman