Bug#779207: patch submission for improving code pages support in unzip

Ivan Sorokin Sun, 26 May 2024 08:51:17 -0700

Dear colleagues,
 
I am writing to bring to your attention an issue with the current upstream 
version of unzip that has not been updated for many years. In the modern 
environment, where the vast majority of systems use UTF-8, unzip exhibits 
several problems that need addressing:
 
1) unzip is unable to correctly extract files containing the bit 11 in the 
General Purpose flag. This bit indicates that the file names are encoded in 
UTF-8. However, unzip attempts to re-encode them as if they are in OEM 
codepage, leading to incorrect file names.
2) By default, unzip does not display UTF-8 encoding correctly on Unix systems.
3) It is necessary to determine the OEM codepage correctly based on the system 
locale, rather than using a single codepage for all archives.
4) The assumption that archives for which the legacy codepage cannot be 
determined are encoded in ISO 8859-1 is incorrect. In reality, most archivers 
used the user's system codepage, which could be any codepage. It is reasonable 
not to alter the encoding in this case, ensuring that the archive can be opened 
at least on the same system where it was created. Additionally, options -O and 
-I have been added to specify the encoding manually.
 
I have prepared a patch (based on a similar patch from Ubuntu, with significant 
enhancements) that addresses these issues. A significant difference from the 
Ubuntu patch is that my code is capable of selecting the OEM codepage based on 
the system locale, instead of assuming the Russian/Cyrillic CP866 codepage for 
all archives when the system is set to UTF-8.
 
I hope you will find this patch useful.
 
Best regards,
Ivan Sorokin

From: Giovanni Scafora <giovanni.archlinux.org>
Subject: unzip files encoded with non-latin, non-unicode file names
Last-Update: 2024-02-22


* Updated 2015-02-11 by Marc Deslauriers <marc.deslauri...@canonical.com>
  to fix buffer overflow in charset_to_intern()
* Updated 2023-06-15 by Dominik Viererbe <dominik.viere...@canonical.com>
  to add documentation for `-I` and `-O` options (LP: #138307) and fixed
  garbled output when `zipinfo` or `unzip -Z` is called without arguments
  (LP: #1429939)
* Updated 2024-05-26 by Ivan Sorokin <un...@mail.ru>
  to fix several problems in codepages support:
  1) Fixed bit 11 of General purpose flag support on systems with UTF-8
  system charset
  2) Fixed OEM code page being always assumed Russian/Cyrillic CP866
  on any UTF-8 system
  3) Added proper OEM code page detection based on system locale setting
  4) Removed translation from ISO 8859-1 to local charset;
  assumption that any non-unicode archive uses it is for sure wrong
  as it can be any charset used on archive creator's local system;
  also do not treat PKZIP for UNIX 2.51 archives
  as having ISO 8859-1 charset for the same reasons
  5) Enabled UTF-8 output by default on Unix systems

--- a/INSTALL
+++ b/INSTALL
@@ -480,7 +480,7 @@ To compile UnZip, UnZipSFX and/or fUnZip (detailed instructions):
       NO_WORKING_ISPRINT
         The symbol HAVE_WORKING_ISPRINT enables enhanced non-printable chars
         filtering for filenames in the fnfilter() function.  On some systems
-        (Unix, VMS, some Win32 compilers), this setting is enabled by default.
+        (VMS, some Win32 compilers), this setting is enabled by default.
         In cases where isprint() flags printable extended characters as
         unprintable, defining NO_WORKING_ISPRINT allows to disable the enhanced
         filtering capability in fnfilter().  (The ASCII control codes 0x01 to
--- a/fileio.c
+++ b/fileio.c
@@ -2144,9 +2144,15 @@ int do_string(__G__ length, option)   /* return PK-type error code */
                 /* translate the text coded in the entry's host-dependent
                    "extended ASCII" charset into the compiler's (system's)
                    internal text code page */
-                Ext_ASCII_TO_Native((char *)G.outbuf, G.pInfo->hostnum,
-                                    G.pInfo->hostver, G.pInfo->HasUxAtt,
-                                    FALSE);
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+                if (!G.pInfo->GPFIsUTF8 || !G.native_is_utf8) {
+#endif
+                        Ext_ASCII_TO_Native((char *)G.outbuf, G.pInfo->hostnum,
+                                            G.pInfo->hostver, G.pInfo->HasUxAtt,
+                                            FALSE);
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+                }
+#endif
 #ifdef WINDLL
                 /* translate to ANSI (RTL internal codepage may be OEM) */
                 INTERN_TO_ISO((char *)G.outbuf, (char *)G.outbuf);
@@ -2258,8 +2264,14 @@ int do_string(__G__ length, option)   /* return PK-type error code */
 
         /* translate the Zip entry filename coded in host-dependent "extended
            ASCII" into the compiler's (system's) internal text code page */
-        Ext_ASCII_TO_Native(G.filename, G.pInfo->hostnum, G.pInfo->hostver,
-                            G.pInfo->HasUxAtt, (option == DS_FN_L));
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+        if (!G.pInfo->GPFIsUTF8 || !G.native_is_utf8) {
+#endif
+            Ext_ASCII_TO_Native(G.filename, G.pInfo->hostnum, G.pInfo->hostver,
+                                G.pInfo->HasUxAtt, (option == DS_FN_L));
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+        }
+#endif
 
         if (G.pInfo->lcflag)      /* replace with lowercase filename */
             STRLOWER(G.filename, G.filename);
--- a/man/unzip.1
+++ b/man/unzip.1
@@ -325,6 +325,8 @@ extension, it is replaced by the info from the extra field.)
 [MacOS only] ignore filenames stored in MacOS extra fields. Instead, the
 most compatible filename stored in the generic part of the entry's header
 is used.
+.IP \fB\-I\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for UNIX and other archives.
 .TP
 .B \-j
 junk paths.  The archive's directory structure is not recreated; all files
@@ -386,6 +388,8 @@ of \fIzip\fP(1), which stores filenotes as comments.
 overwrite existing files without prompting.  This is a dangerous option, so
 use it with care.  (It is often used with \fB\-f\fP, however, and is the only
 way to overwrite directory EAs under OS/2.)
+.IP \fB\-O\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for DOS, Windows and OS/2 archives.
 .IP \fB\-P\fP\ \fIpassword\fP
 use \fIpassword\fP to decrypt encrypted zipfile entries (if any).  \fBTHIS IS
 INSECURE!\fP  Many multi-user operating systems provide ways for any user to
--- a/man/zipinfo.1
+++ b/man/zipinfo.1
@@ -174,6 +174,10 @@ back to the behaviour of previous versions.
 .TP
 .B \-z
 include the archive comment (if any) in the listing.
+.IP \fB\-I\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for UNIX and other archives.
+.IP \fB\-O\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for DOS, Windows and OS/2 archives.
 .PD
 .\" =========================================================================
 .SH "DETAILED DESCRIPTION"
--- a/unix/unix.c
+++ b/unix/unix.c
@@ -30,6 +30,10 @@
 #define UNZIP_INTERNAL
 #include "unzip.h"
 
+#include <iconv.h>
+#include <langinfo.h>
+#include <stdbool.h>
+
 #ifdef SCO_XENIX
 #  define SYSNDIR
 #else  /* SCO Unix, AIX, DNIX, TI SysV, Coherent 4.x, ... */
@@ -1874,3 +1878,161 @@ static void qlfix(__G__ ef_ptr, ef_len)
     }
 }
 #endif /* QLZIP */
+
+
+typedef struct {
+    char *local_charset;
+    char *archive_charset;
+} CHARSET_MAP;
+
+/* A mapping of local <-> archive charsets used by default to convert filenames
+ * of DOS/Windows Zip archives. Currently very basic. */
+static CHARSET_MAP dos_charset_map[] = {
+    { "ANSI_X3.4-1968", "CP850" },
+    { "ISO-8859-1", "CP850" },
+    { "CP1252", "CP850" },
+    { "KOI8-R", "CP866" },
+    { "KOI8-U", "CP866" },
+    { "ISO-8859-5", "CP866" }
+};
+
+char OEM_CP[MAX_CP_NAME] = "";
+char ISO_CP[MAX_CP_NAME] = "";
+
+/* Try to guess the default value of OEM_CP based on the current locale.
+ * ISO_CP is left alone for now. */
+void init_conversion_charsets()
+{
+    const char *local_charset;
+    int i;
+
+    /* Make a guess only if OEM_CP not already set. */
+    if(*OEM_CP == '\0') {
+        local_charset = nl_langinfo(CODESET);
+        for(i = 0; i < sizeof(dos_charset_map)/sizeof(CHARSET_MAP); i++)
+            if(!strcasecmp(local_charset, dos_charset_map[i].local_charset)) {
+                    strncpy(OEM_CP, dos_charset_map[i].archive_charset,
+                        sizeof(OEM_CP));
+                break;
+            }
+    }
+
+    // Still not detected? Try to detect by system locale
+    if(*OEM_CP == '\0') {
+
+      const char *lcToOemTable[] = {
+        "af_ZA", "CP850", "ar_SA", "CP720", "ar_LB", "CP720", "ar_EG", "CP720",
+        "ar_DZ", "CP720", "ar_BH", "CP720", "ar_IQ", "CP720", "ar_JO", "CP720",
+        "ar_KW", "CP720", "ar_LY", "CP720", "ar_MA", "CP720", "ar_OM", "CP720",
+        "ar_QA", "CP720", "ar_SY", "CP720", "ar_TN", "CP720", "ar_AE", "CP720",
+        "ar_YE", "CP720", "ast_ES", "CP850", "az_AZ", "CP866", "az_AZ", "CP857",
+        "be_BY", "CP866", "bg_BG", "CP866", "br_FR", "CP850", "ca_ES", "CP850",
+        "zh_CN", "CP936", "zh_TW", "CP950", "kw_GB", "CP850", "cs_CZ", "CP852",
+        "cy_GB", "CP850", "da_DK", "CP850", "de_AT", "CP850", "de_LI", "CP850",
+        "de_LU", "CP850", "de_CH", "CP850", "de_DE", "CP850", "el_GR", "CP737",
+        "en_AU", "CP850", "en_CA", "CP850", "en_GB", "CP850", "en_IE", "CP850",
+        "en_JM", "CP850", "en_BZ", "CP850", "en_PH", "CP437", "en_ZA", "CP437",
+        "en_TT", "CP850", "en_US", "CP437", "en_ZW", "CP437", "en_NZ", "CP850",
+        "es_PA", "CP850", "es_BO", "CP850", "es_CR", "CP850", "es_DO", "CP850",
+        "es_SV", "CP850", "es_EC", "CP850", "es_GT", "CP850", "es_HN", "CP850",
+        "es_NI", "CP850", "es_CL", "CP850", "es_MX", "CP850", "es_ES", "CP850",
+        "es_CO", "CP850", "es_ES", "CP850", "es_PE", "CP850", "es_AR", "CP850",
+        "es_PR", "CP850", "es_VE", "CP850", "es_UY", "CP850", "es_PY", "CP850",
+        "et_EE", "CP775", "eu_ES", "CP850", "fa_IR", "CP720", "fi_FI", "CP850",
+        "fo_FO", "CP850", "fr_FR", "CP850", "fr_BE", "CP850", "fr_CA", "CP850",
+        "fr_LU", "CP850", "fr_MC", "CP850", "fr_CH", "CP850", "ga_IE", "CP437",
+        "gd_GB", "CP850", "gv_IM", "CP850", "gl_ES", "CP850", "he_IL", "CP862",
+        "hr_HR", "CP852", "hu_HU", "CP852", "id_ID", "CP850", "is_IS", "CP850",
+        "it_IT", "CP850", "it_CH", "CP850", "iv_IV", "CP437", "ja_JP", "CP932",
+        "kk_KZ", "CP866", "ko_KR", "CP949", "ky_KG", "CP866", "lt_LT", "CP775",
+        "lv_LV", "CP775", "mk_MK", "CP866", "mn_MN", "CP866", "ms_BN", "CP850",
+        "ms_MY", "CP850", "nl_BE", "CP850", "nl_NL", "CP850", "nl_SR", "CP850",
+        "nn_NO", "CP850", "nb_NO", "CP850", "pl_PL", "CP852", "pt_BR", "CP850",
+        "pt_PT", "CP850", "rm_CH", "CP850", "ro_RO", "CP852", "ru_RU", "CP866",
+        "sk_SK", "CP852", "sl_SI", "CP852", "sq_AL", "CP852", "sr_RS", "CP855",
+        "sr_RS", "CP852", "sv_SE", "CP850", "sv_FI", "CP850", "sw_KE", "CP437",
+        "th_TH", "CP874", "tr_TR", "CP857", "tt_RU", "CP866", "uk_UA", "CP866",
+        "ur_PK", "CP720", "uz_UZ", "CP866", "uz_UZ", "CP857", "vi_VN", "CP1258",
+        "wa_BE", "CP850", "zh_HK", "CP950", "zh_SG", "CP936"};
+
+      int tableLen = sizeof(lcToOemTable) / sizeof(lcToOemTable[0]);
+      int lcLen = 0, i;
+
+      // Detect required code page name from current locale
+      char *lc = setlocale(LC_CTYPE, "");
+
+      if (lc && lc[0]) {
+        // Compare up to the dot, if it exists, e.g. en_US.UTF-8
+        for (lcLen = 0; lc[lcLen] != '.' && lc[lcLen] != ':' && lc[lcLen] != '\0'; ++lcLen);
+
+        for (i = 0; i < tableLen; i += 2)
+
+          if (strncmp(lc, (lcToOemTable[i]), lcLen) == 0) {
+
+            strncpy(OEM_CP, lcToOemTable[i + 1],
+              sizeof(OEM_CP));
+
+            break;
+          }
+      }
+    }
+}
+
+/* Convert a string from one encoding to the current locale using iconv().
+ * Be as non-intrusive as possible. If error is encountered during covertion
+ * just leave the string intact. */
+static void charset_to_intern(char *string, char *from_charset)
+{
+    iconv_t cd;
+    char *s,*d, *buf;
+    size_t slen, dlen, buflen;
+    const char *local_charset;
+
+    if(*from_charset == '\0')
+        return;
+
+    buf = NULL;
+    local_charset = nl_langinfo(CODESET);
+
+    if((cd = iconv_open(local_charset, from_charset)) == (iconv_t)-1)
+        return;
+
+    slen = strlen(string);
+    s = string;
+
+    /*  Make sure OUTBUFSIZ + 1 never ends up smaller than FILNAMSIZ
+     *  as this function also gets called with G.outbuf in fileio.c
+     */
+    buflen = FILNAMSIZ;
+    if (OUTBUFSIZ + 1 < FILNAMSIZ)
+    {
+        buflen = OUTBUFSIZ + 1;
+    }
+
+    d = buf = malloc(buflen);
+    if(!d)
+        goto cleanup;
+
+    bzero(buf,buflen);
+    dlen = buflen - 1;
+
+    if(iconv(cd, &s, &slen, &d, &dlen) == (size_t)-1)
+        goto cleanup;
+    strncpy(string, buf, buflen);
+
+    cleanup:
+    free(buf);
+    iconv_close(cd);
+}
+
+/* Convert a string from OEM_CP to the current locale charset. */
+inline void oem_intern(char *string)
+{
+    charset_to_intern(string, OEM_CP);
+}
+
+/* Convert a string from ISO_CP to the current locale charset. */
+inline void iso_intern(char *string)
+{
+    charset_to_intern(string, ISO_CP);
+}
--- a/unix/unxcfg.h
+++ b/unix/unxcfg.h
@@ -174,8 +174,8 @@ typedef struct stat z_stat;
 #endif
 #ifndef NO_SETLOCALE
 # if (!defined(NO_WORKING_ISPRINT) && !defined(HAVE_WORKING_ISPRINT))
-   /* enable "enhanced" unprintable chars detection in fnfilter() */
-#  define HAVE_WORKING_ISPRINT
+   /* disable "enhanced" unprintable chars detection in fnfilter() */
+#  define NO_WORKING_ISPRINT
 # endif
 #endif
 
@@ -228,4 +228,30 @@ typedef struct stat z_stat;
 /* wild_dir, dirname, wildname, matchname[], dirnamelen, have_dirname, */
 /*    and notfirstcall are used by do_wild().                          */
 
+
+#define MAX_CP_NAME 25
+
+#ifdef SETLOCALE
+#  undef SETLOCALE
+#endif
+#define SETLOCALE(category, locale) setlocale(category, locale)
+#include <locale.h>
+
+#ifdef _ISO_INTERN
+#  undef _ISO_INTERN
+#endif
+#define _ISO_INTERN(str1) iso_intern(str1)
+
+#ifdef _OEM_INTERN
+#  undef _OEM_INTERN
+#endif
+#ifndef IZ_OEM2ISO_ARRAY
+#  define IZ_OEM2ISO_ARRAY
+#endif
+#define _OEM_INTERN(str1) oem_intern(str1)
+
+void iso_intern(char *);
+void oem_intern(char *);
+void init_conversion_charsets(void);
+
 #endif /* !__unxcfg_h */
--- a/unzip.c
+++ b/unzip.c
@@ -327,11 +327,21 @@ static ZCONST char Far ZipInfoUsageLine2[] = "\nmain\
   -2  just filenames but allow -h/-t/-z  -l  long Unix \"ls -l\" format\n\
                                          -v  verbose, multi-page format\n";
 
+#ifndef UNIX
 static ZCONST char Far ZipInfoUsageLine3[] = "miscellaneous options:\n\
   -h  print header line       -t  print totals for listed files or for all\n\
-  -z  print zipfile comment   -T  print file times in sortable decimal format\
-\n  -C  be case-insensitive   %s\
+  -z  print zipfile comment   -T  print file times in sortable decimal format\n\
+  -C  be case-insensitive     %s\
   -x  exclude filenames that follow from listing\n";
+#else /* UNIX */
+static ZCONST char Far ZipInfoUsageLine3[] = "miscellaneous options:\n\
+  -h  print header line       -t  print totals for listed files or for all\n\
+  -z  print zipfile comment   -T  print file times in sortable decimal format\n\
+  -C  be case-insensitive   %s\
+  -x  exclude filenames that follow from listing\n\
+  -O CHARSET  specify a character encoding for DOS, Windows and OS/2 archives\n\
+  -I CHARSET  specify a character encoding for UNIX and other archives\n";
+#endif /* !UNIX */
 #ifdef MORE
    static ZCONST char Far ZipInfoUsageLine4[] =
      "  -M  page output through built-in \"more\"\n";
@@ -664,6 +674,17 @@ modifiers:\n\
   -U  use escapes for all non-ASCII Unicode  -UU ignore any Unicode fields\n\
   -C  match filenames case-insensitively     -L  make (some) names \
 lowercase\n %-42s  -V  retain VMS version numbers\n%s";
+#elif (defined UNIX)
+static ZCONST char Far UnzipUsageLine4[] = "\
+modifiers:\n\
+  -n  never overwrite existing files         -q  quiet mode (-qq => quieter)\n\
+  -o  overwrite files WITHOUT prompting      -a  auto-convert any text files\n\
+  -j  junk paths (do not make directories)   -aa treat ALL files as text\n\
+  -U  use escapes for all non-ASCII Unicode  -UU ignore any Unicode fields\n\
+  -C  match filenames case-insensitively     -L  make (some) names \
+lowercase\n %-42s  -V  retain VMS version numbers\n%s\
+  -O CHARSET  specify a character encoding for DOS, Windows and OS/2 archives\n\
+  -I CHARSET  specify a character encoding for UNIX and other archives\n\n";
 #else /* !VMS */
 static ZCONST char Far UnzipUsageLine4[] = "\
 modifiers:\n\
@@ -802,6 +823,10 @@ int unzip(__G__ argc, argv)
 #endif /* UNICODE_SUPPORT */
 
 
+#ifdef UNIX
+    init_conversion_charsets();
+#endif
+
 #if (defined(__IBMC__) && defined(__DEBUG_ALLOC__))
     extern void DebugMalloc(void);
 
@@ -1335,6 +1360,11 @@ int uz_opts(__G__ pargc, pargv)
     argc = *pargc;
     argv = *pargv;
 
+#ifdef UNIX
+    extern char OEM_CP[MAX_CP_NAME];
+    extern char ISO_CP[MAX_CP_NAME];
+#endif
+
     while (++argv, (--argc > 0 && *argv != NULL && **argv == '-')) {
         s = *argv + 1;
         while ((c = *s++) != 0) {    /* "!= 0":  prevent Turbo C warning */
@@ -1516,6 +1546,35 @@ int uz_opts(__G__ pargc, pargv)
                     }
                     break;
 #endif  /* MACOS */
+#ifdef UNIX
+                case ('I'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Icharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(ISO_CP, s, sizeof(ISO_CP));
+                        } else { /* -I charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(ISO_CP, s, sizeof(ISO_CP));
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case ('j'):    /* junk pathnames/directory structure */
                     if (negative)
                         uO.jflag = FALSE, negative = 0;
@@ -1591,6 +1650,35 @@ int uz_opts(__G__ pargc, pargv)
                     } else
                         ++uO.overwrite_all;
                     break;
+#ifdef UNIX
+                case ('O'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Ocharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                        } else { /* -O charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -O argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case ('p'):    /* pipes:  extract to stdout, no messages */
                     if (negative) {
                         uO.cflag = FALSE;
@@ -2162,6 +2250,7 @@ static void help_extended(__G)
   "         ACORN_FTYPE_NFS] Translate filetype and append to name.",
   "  -i   [MacOS] Ignore filenames in MacOS extra field.  Instead, use name in",
   "         standard header.",
+  "  -I CHARSET  [UNIX] Specify a character encoding for UNIX and other archives.",
   "  -j   Junk paths and deposit all files in extraction directory.",
   "  -J   [BeOS] Junk file attributes.  [MacOS] Ignore MacOS specific info.",
   "  -K   [AtheOS, BeOS, Unix] Restore SUID/SGID/Tacky file attributes.",
@@ -2172,6 +2261,8 @@ static void help_extended(__G)
   "  -N   [Amiga] Extract file comments as Amiga filenotes.",
   "  -o   Overwrite existing files without prompting.  Useful with -f.  Use with",
   "         care.",
+  "  -O CHARSET  [UNIX] Specify a character encoding for DOS, Windows",
+  "                and OS/2 archives.",
   "  -P p Use password p to decrypt files.  THIS IS INSECURE!  Some OS show",
   "         command line to other users.",
   "  -q   Perform operations quietly.  The more q (as in -qq) the quieter.",
@@ -2264,6 +2355,9 @@ static void help_extended(__G)
   "        representing the Unicode character number of the character in hex.",
   "  -UU [UNICODE]  Disable use of any UTF-8 path information.",
   "  -z  Include archive comment if any in listing.",
+  "  -O CHARSET  [UNIX] Specify a character encoding for DOS, Windows",
+  "                and OS/2 archives.",
+  "  -I CHARSET  [UNIX] Specify a character encoding for UNIX and other archives.",
   "",
   "",
   "funzip stream extractor:",
--- a/unzpriv.h
+++ b/unzpriv.h
@@ -3012,16 +3012,24 @@ char    *GetLoadPath     OF((__GPRO));                              /* local */
  *
  * All other ports are assumed to code zip entry filenames in ISO 8859-1.
  */
+
+// 2024-05-25 Removed "|| (isuxatt)": actually we know nothing
+// about local system's codepage of PKZIP 2.51 UNIX users.
+// Also removed "_ISO_INTERN((string)); \":
+// Windows ANSI is not always 1252, also standard defines default
+// charset as CP437, not ISO 8859-1. But in fact most of packers
+// just used local system's charset, so without any charset translation
+// we will at least make such archives processed correctly
+// on the same system - Ivan Sorokin <un...@mail.ru>
+
 #ifndef Ext_ASCII_TO_Native
 #  define Ext_ASCII_TO_Native(string, hostnum, hostver, isuxatt, islochdr) \
     if (((hostnum) == FS_FAT_ && \
-         !(((islochdr) || (isuxatt)) && \
+         !((islochdr) && \
            ((hostver) == 25 || (hostver) == 26 || (hostver) == 40))) || \
         (hostnum) == FS_HPFS_ || \
-        ((hostnum) == FS_NTFS_ && (hostver) == 50)) { \
+        ((hostnum) == FS_NTFS_)) { \
         _OEM_INTERN((string)); \
-    } else { \
-        _ISO_INTERN((string)); \
     }
 #endif
 
--- a/zipinfo.c
+++ b/zipinfo.c
@@ -457,6 +457,10 @@ int zi_opts(__G__ pargc, pargv)
     int    tflag_slm=TRUE, tflag_2v=FALSE;
     int    explicit_h=FALSE, explicit_t=FALSE;
 
+#ifdef UNIX
+    extern char OEM_CP[MAX_CP_NAME];
+    extern char ISO_CP[MAX_CP_NAME];
+#endif
 
 #ifdef MACOS
     uO.lflag = LFLAG;         /* reset default on each call */
@@ -501,6 +505,35 @@ int zi_opts(__G__ pargc, pargv)
                             uO.lflag = 0;
                     }
                     break;
+#ifdef UNIX
+                case ('I'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Icharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(ISO_CP, s, sizeof(ISO_CP));
+                        } else { /* -I charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(ISO_CP, s, sizeof(ISO_CP));
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case 'l':      /* longer form of "ls -l" type listing */
                     if (negative)
                         uO.lflag = -2, negative = 0;
@@ -521,6 +554,35 @@ int zi_opts(__G__ pargc, pargv)
                         G.M_flag = TRUE;
                     break;
 #endif
+#ifdef UNIX
+                case ('O'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Ocharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                        } else { /* -O charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -O argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case 's':      /* default:  shorter "ls -l" type listing */
                     if (negative)
                         uO.lflag = -2, negative = 0;

Bug#779207: patch submission for improving code pages support in unzip

Reply via email to