Hello again,

so two weeks have passed, and I've moved at a glacial pace towards a method how to measure compatibility of our generated ZIP files. Sorry, I just keep getting distracted.

Anyway, the idea is to have a bunch of files with names using different scripts, zip them with several packers (including git archive), unzip them and compare the result with the original files.

As test corpus I used files named like the pangrams on this UTF-8 sampler page, the exact commands are attached:


The numbers below are how many lines the output of diff -ru contains for this pair of packer and unpacker. There are 37 files, so the worst result is 74 lines of difference ("Only in [...]" for both sides), while 0 indicates a perfect score.

Hmm, come to think of it, an empty directory would show up as 37, so this metric is not ideal. A better one would be to simply give one point for each correctly unpacked file.

                                         Windows    Info-ZIP unzip
                            7-Zip PeaZip builtin Linux msysgit Windows
7-Zip 9.20                      0      0      46    26      43      43
PeaZip 4.7.1 win64              0      0      46    26      42      42
Info-ZIP zip 3.0 Linux          0      0      72     0      43      43
Info-ZIP zip 3.0 Windows       45     45     n/a     0      43      43
git-master                     72     72      72    60      72      72
git-master-patch1               0      0      72    60      72      72
git-master-patch2               0      0      72     0      72      72
git-v1.7.11.msysgit.1          72     72      72    60      72      72
git-v1.7.11.msysgit.1-patch1    0      0      72    60      72      72
git-v1.7.11.msysgit.1-patch2    0      0      72     0      72      72

Info-ZIP's programs don't work too well on Windows. The built-in unzipper of Windows 7 even refuses to open the file created by the Windows version of zip. Speaking of which, this is the worst of the unpackers.

With the two patches applied, we can say "use 7-Zip or PeaZip on Windows and unzip on Linux" and filenames with all tested characters will be preserved. I was surprised to see this working fine with msysgit like that, even though no reencoding is introduced by the patches.

I wonder what 7-Zip and PeaZip do that gives them a slightly nicer score with the Windows-internal unzipper. Umlauts, Nordic characters and accents are preserved by that combination. It seems that unzip on Linux fails to unpack exactly these names, so perhaps they employ a dirty trick like using the local encoding in the ZIP file, which makes it unportable.

I'll reply with the two patches, which contain basically the same code as the previous patch, only split up. The second one declares that filenames with UTF-8 encoding came from Unix (instead of FAT), which makes unzip happy. This, however, implies that we contain Unix permissions for these entries, which is a bit ugly.

        mkdir pangrams
        cd pangrams

        echo English >"The quick brown fox jumps over the lazy dog"
        echo Irish 1 >"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá 
a �eall"
        echo Irish 2 >"lena ṗóg éada ó ṡlí do leasa ṫú"
        echo Irish 3 >"D'ḟuascail �osa Úr�ac na hÓiġe Beannaiṫe pór"
        echo Irish 4 >"Éava agus �ḋai�"
        echo Dutch >"Pa's wijze lynx bezag vroom het fikse aquaduct"
        echo German 1 >"Falsches Üben von Xylophonmusik quält"
        echo German 2 >"jeden größeren Zwerg"
        echo Norwegian >"Blåbærsyltetøy"
        echo Danish >"Høj bly gom vandt fræk sexquiz på wc"
        echo Swedish >"Flygande bäckasiner söka strax hwila på mjuka tuvor"
        echo Icelandic >"Sævör grét áðan því úlpan var ónýt"
        echo Finnish >"Törkylempijävongahdus"
        echo Polish >"Pchnąć w tę łódź jeża lub osiem skrzyń fig"
        echo Czech >"Příliš žluťou�ký kůň úpěl �ábelské kódy"
        echo Slovak 1 >"Starý kôň na hŕbe kníh žuje tíško povädnuté 
        echo Slovak 2 >"na stĺpe sa �ateľ u�í kvákať novú ódu o 
        echo monotonic Greek >"ξεσκεπάζω την ψυχοφθό�α 
        echo polytonic Greek >"ξεσκεπάζω τὴν ψυχοφθό�α 
        echo Russian >"Съешь же ещё �тих м�гких 
француз�ких булок да выпей чаю"
        echo Bulgarian 1 >"Жълтата дюл� беше ща�тлива"
        echo Bulgarian 2 >"че пухът, който цъфна, 
замръзна като гьон"
        echo Northern Sami >"Vuol Ruoŧa geđggiid leat máŋga luosa ja 
        echo Hungarian >"�rvíztűrő tükörfúrógép"
        echo Spanish 1 >"El pingüino Wenceslao hizo kilómetros bajo 
        echo Spanish 2 >"lluvia y frío añoraba a su querido cachorro"
        echo Portuguese 1 >"O próximo vôo à noite sobre o Atlântico"
        echo Portuguese 2 >"põe freqüentemente o único médico"
        echo French 1 >"Les naïfs ægithales hâtifs pondant à Noël où il 
        echo French 2 >"sont sûrs d'être déçus en voyant leurs drôles"
        echo French 3 >"d'œufs abîmés"
        echo Esperanto >"Eĥo�an�o ĉiuĵaŭde"
        echo Hebrew >"זה כיף סת� לשמוע �יך תנצח קרפד 
עץ טוב בגן"
        echo Hiragana 1 >"������� �り�るを"
        echo Hiragana 2 >"��よ�れ� ���らむ"
        echo Hiragana 3 >"�����や� �����"
        echo Hiragana 4 >"���ゆ��� ゑ�も��"

Reply via email to