Hi David,

thanks for diving in.



On Thu, Nov 29, 2012 at 11:00 AM, David Kastrup <d...@gnu.org> wrote:

> "jose.ali...@gmail.com" <jose.ali...@gmail.com> writes:
>
> > Hi,
> >
> > Thanks for reporting this. This error is on the parse of the metadata.
> > I have no time right now to look in deep at it, will try to do next
> > week, but the description you give is wrong to my eyes, so another
> > thing must be happening. I'll try to explain. One thing is that the
> > character "ä" is U+00e4, and another thing is how to code this
> > character in UTF-8, where you need two bytes, and the code is c3 a4,
> > so if lilypond are trying to code "ä" as a e4, this is not a valid
> > UTF-8 code!
>
> Sure, it isn't.  But pdfmarks are not encoded in UTF-8.  They are
> encoded either in PDFDocEncoding (a subset of Latin-1) or in UTF16BE
> with byte order mark.
>
Of course you are right, but we are talking about different parts of the
PDF file.

For the record, i didn't mean that lilypond is doing it wrong.  I just said
that  the xml parser is getting a e4 instead of c3a4, so is normal that the
xml parser choke as e4 is not a valid utf-8 code!... so please take this
last phrase as what I wanted to say ;) and not that this is a lilypond bug.


>
> Complain to Adobe about their choice, but as long as that is the way PDF
> encodes stuff, Evince can't unilaterally decide for something saner.
>
>  We don't decide nothing unilaterally, we follow the PDF spec as everyone
else,  so if you in lilypond are producing a up-to-spec pdf file , it is of
course our bug and not yours. :)


> > Please note that the code that throws the error is the libxml parser,
> > which usually is very strict about encodings and things like that.
>
> The respective part in the PDF looks like
>
> <</Producer(GPL Ghostscript 9.06)
> /CreationDate(D:20121128183026+01'00')
> /ModDate(D:20121128183026+01'00')
> /Creator(LilyPond 2.17.7)
> /Author(\344 \366)
> /Title(\376\377\003\262)
> /Composer(\344 \366)>>endobj
>
> As you can see, there is no XML involved here at all.  Note that the PDF
> in the original report was generated from an input file accidentally
> written in Latin-1 (LilyPond requires UTF-8 input), so all bets are off
> with that.  However, when correctly encoding the input as UTF-8, at
> least the author field will still be cranked out encoded as
> Latin-1/PDFDocEncoding, and Evince (in contrast to other viewers and
> pdfinfo) will complain with the mentioned XML error.  Since it would
> appear that Evince generates that XML itself as part of its internal
> operations, it seems like it fails to convert PDFDocEncoding to UTF-8 in
> the process.
>

I think that you are not correct about  what is happening here. We
interpret these using poppler, so we get the same result as in pdfinfo :)
(You can see properties of the file in evince and you will see them)

In the unicode.pdf test file in the start of this thread, I can see a
Metadata dict with a stream that contains the metadata xml. In this
metadata xml there is an "ä" character in the creator field of the rdf.
That is the "ä" libxml parser is complaining about.

In particular, there is XML involved! and we don't generate this xml
ourselves, but it is present on the pdf.

So that being said, I still have to read the stream in poppler and see how
the character is getting encoded in this xml. If this character is encoded
on latin1, that would explain the error.


Greetings

José


> --
> David Kastrup
>
_______________________________________________
evince-list mailing list
evince-list@gnome.org
https://mail.gnome.org/mailman/listinfo/evince-list

Reply via email to