Hey Georg, > >> Additional tests showed that encoding with EarlyChange = 0 (very > >> unusual) needs a bigger table (LZW_MAX_DICTSIZE + 1). I have done a lot > >> of testing and the LZW encoder (with EarlyChange = 0) now outputs the > >> same files as PDFlib-Lite-5.0.4p1 does. (PDFlib-Lite-5.0.4p1 is the only > >> encoder with EarlyChange = 0 I found) > >> > > > > So the changes done will make it work both with EarlyChange=0 and > > EarlyChange=1, I am assuming. > > Yes. >
Great. > > Could you maybe try to explain one by one the changes in > > 'src/base/pdf-stm-f-lzw.c'? They seem pretty straightforward, but I > > would like to know the reasoning behind each of them. Are they all due > > to needing a bigger table with EarlyChange=0? > > First of all: the standard is very vague. It says: > "[EarlyChange:] An indication of when to increase the code length. If > the value of this entry is 0, code length increases shall be postponed > as long as possible. If the value is 1, code length increases shall > occur one code early." > I think the Wikipedia page for LZW explains it quite well, check the "Variable-width codes" section in: http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch > As I could not figure out which code is meant (input, output, or > dictionary) this remains unclear to me. The code which is increased is the generated output during encoding; or the read input during decoding. When encoding, the LZW filter will start using N bits in the generated code; until the moment where it finds that it needs to generate N+1 bits: when 'ω+s' is not found in the dictionary. Once this happens, the encoder can either: * Emit 'ω' with width N and then increase the size of the code to be generated in one bit (standard). * Emit 'ω' with width N+1 directly (EarlyChange) The decoder needs to know that information while trying to decode the bytestream, or it won't be able to match properly the boundaries of the generated codes. > So I used showpdf(mupdf) to get > (real world) examples and compared the decoded bytes with vbindiff and > decoded examples by hand. It shows that the lzw_buffer_inc_bitsize comes > to early [1]. > > This fix broke my dec(enc(rand.bin) == rand.bin "test". So I used > showpdf examples again ... It shows that the reset code is encoded two > codes to early[2]. > > I have done this procedure for EarlyChange=0 as well and I found out > that the dictionary is to small (after looking at the source code of > pdflib5) [3]. > > At [4],[5] I increased the numbers because of the increased DICTSIZE (to > get the same result) But I am not 100% sure about this. Someone with > experience in LZW maybe double-check this. > I will try to get deep on the process and check all this reasoning. That +1 in the DICTSIZE really seemed to be just a convenience. > [6] is irrelevant because it belongs to an further bug I will commit > later. Sorry for that. I will remove it. > Ok, thanks for noticing. I will remove that chunk myself from the branch I'm reviewing. > > Are we testing these fixes with more than one dataset? > The unit test is with two test sets (that has at least one dict reset). > One with EarlyChange == 0, one with EarlyChange == 1; > > But I tested LZW decoding with several files by following creators: > * Acrobat PDFWriter 2.01 for Windows > * Acrobat PDFWriter 3.0 for Windows > * Acrobat 3.0 Import Plug-in > * Acrobat Distiller 2.0 for Windows > * GPL Ghostscript 8.71 > > For encoding see next answer. > > > How sure are we > > that we're not fixing one test case and breaking all the others? > > Good question. As the standard remains unclear (to me) I only can do > testing and looking at other open projects source code. Without my fix I > am able to show you dozens of bad decoded PDFs. With my fix I cannot. > > With the encoder it is much harder. Because the decoder listens to the > reset code you can reset to early and the decoded result is correct > anyway. My tests showed that in fact there are different LZW Encoder > implementations (for example (with EarlyChange = 1): GNU PDF is binary > same as Ghostscript 8.71 but "Acrobat Distiller 2.0" or "Acrobat > PDFWriter 2.01" places the reset code later) > > I think my tests for the encoder are quite good. I tested (with > EarlyChange = 1) 3 different PDFs created with Ghostscript 8.71 with a > 48KB to 50KB LZW-Stream. The decoded bytes (with GNU PDF or pdfshow) > encode to the origin bytes. > For the encoder side, we will possibly always skip EarlyChange, which is the standard way of doing it. Cheers, -- Aleksander
signature.asc
Description: This is a digitally signed message part