Hi Enrico,

Thanks for your feed back.

Sorry, but I cannot reproduce this problem. The file names are already
> utf-8 encoded as they are extracted form the LyX file and the created
> zip file seems to be OK. I attach here an example zip file created with
> lyxpak.py (it is inside a tar archive to avoid problems with mail agents).
> Although when listed the file names in the zip file may appear as mangled,
> they are actually extracted just fine:
>

I indeed found bugs in the patch I submitted, but for some reason the mail
I sent to the mailing list seems like it wasn't delivered (I hope this one
gets through).

There was a problem in the old patch with handling filesystems where  the
filesystem's encoding is not UTF-8.The problem occurs because in the .lyx
file, the filenames are stored as UTF-8, while the filesystem may use a
different one. For example, Windows uses "mbsc". The attached modified
patch addresses these issues. Also by using the "arcname" argument, the zip
file it produces list the files correctly.

For example, file listing without the patch

Path = בדיקה.zip
Type = zip
Physical Size = 966

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------
------------------------
2015-07-01 21:59:40 .....         1634          738  ×××ק×.lyx
2015-07-01 21:59:04 .....            0            2  ק×××¥.tex
------------------- ----- ------------ ------------
------------------------
                                  1634          740  2 files, 0 folders

And with the patch

Path = בדיקה.zip
Type = zip
Physical Size = 966

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------
------------------------
2015-07-01 21:59:40 .....         1634          738  בדיקה.lyx
2015-07-01 21:59:04 .....            0            2  קובץ.tex
------------------- ----- ------------ ------------
------------------------
                                  1634          740  2 files, 0 folders


The proposed patch should also work with non UTF-8 filesystems (I did
limited testing on a windows 7 machine).

Regarding the first patch.

In your patch you unconditionally strip the last path component, which is
> wrong. Although the case you describe is rare, it may nonetheless happen,
> but the attached 01 patch should suffice. Please, report back whether it
> works for you.
>

The proposed change still fails when the last prefix bit is a valid
directory.

For example consider the following layout

lyx_file.lyx
lyx_included.tex
lyx_/

Where lyx_file.lyx includes lyx_included.tex. The common prefix will be
"lyx_", which is a valid directory, hence the proposed check in your patch
pass, but afterwords, it won't be able to locate the files.

My patch indeed strips the last path component, but I'm pretty sure that
the part that gets removed in never the common parent directory. For
example consider the following cases

some_dir/lyx_file.lyx
some_dir/lyx_included.lyx

The common prefix will be "some_dir/lyx_". The "lyx_" part will indeed be
removed, and the result will be "some_dir/", like it should.

some_dir/lyx_file.lyx
some_dir/tex_file.lyx

The common prefix will be "some_dir/" and because the "/" is part of the
common prefix, "some_dir/" will also be the result of the truncation and
the appending of os.path.sep.

Furthermore, consider the following edge case where the common prefix is
"/". We have
In [6]: os.path.dirname('/') + os.path.sep
Out[6]: '//'

In [7]: "/".rpartition(os.path.sep)[0] + os.path.sep
Out[7]: '/'

This shows that the original lyxpak.py (and I believe your patch as well)
will get topdir to be "//" instead of "/", which means that the string
replacement in

    while i < len(incfiles):
        incfiles[i] = string.replace(incfiles[i], topdir, '', 1)
        i += 1

will fail.

Enrico, did you find a use case where my patch results in the wrong
behaviour? I might be missing some edge-case myself which I be happy to fix.

Thanks a lot for your guidance and remarks.

Regards,

Guy
From 249d30e46b24b92b8e5823e72f8322539620c780 Mon Sep 17 00:00:00 2001
From: Guy Rutenberg <guyrutenb...@gmail.com>
Date: Fri, 26 Jun 2015 19:54:34 +0300
Subject: [PATCH] lyxpak: Fix filename encoding in zip export.

By default Python's zipfile module stores files in the CP437 encoding.
This can create unreadable filenames on some systems. However, if one
passes `arcname` as a unicode object, a utf-8 representation of the
filename is kept.

Moreover, LyX's internal filename representation is UTF-8, which may
differ from the filesystem's encoding. This may lead to problems
including files which have foreign characters in their names. The patch
fixes this issue.
---
 lib/scripts/lyxpak.py |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/lib/scripts/lyxpak.py b/lib/scripts/lyxpak.py
index cc1bdfd..f085b28 100755
--- a/lib/scripts/lyxpak.py
+++ b/lib/scripts/lyxpak.py
@@ -87,6 +87,9 @@ def abspath(name):
         newname = os.path.realpath(newname)
     return newname
 
+def fs_encode(path):
+    """Convert path from utf-8 to filesystem's encoding."""
+    return path.decode('utf-8').encode(sys.getfilesystemencoding())
 
 def gather_files(curfile, incfiles, lyx2lyx):
     " Recursively gather files."
@@ -133,6 +136,7 @@ def gather_files(curfile, incfiles, lyx2lyx):
         maybe_in_ert = is_lyxfile and lines[i] == "\\backslash"
         if match:
             file = match.group(4).strip('"')
+            file = fs_encode(file)
             if not os.path.isabs(file):
                 file = os.path.join(curdir, file)
             file_exists = False
@@ -159,6 +163,7 @@ def gather_files(curfile, incfiles, lyx2lyx):
             file = match.group(3).strip('"')
             if file.startswith("bibtotoc,"):
                 file = file[9:]
+            file = fs_encode(file)
             if not os.path.isabs(file):
                 file = os.path.join(curdir, file + '.bst')
             if os.path.exists(file):
@@ -172,10 +177,11 @@ def gather_files(curfile, incfiles, lyx2lyx):
             bibfiles = match.group(3).strip('"').split(',')
             j = 0
             while j < len(bibfiles):
-                if os.path.isabs(bibfiles[j]):
-                    file = bibfiles[j] + '.bib'
+                bibfile = fs_encode(bibfiles[j])
+                if os.path.isabs(bibfile):
+                    file = bibfile + '.bib'
                 else:
-                    file = os.path.join(curdir, bibfiles[j] + '.bib')
+                    file = os.path.join(curdir, bibfile + '.bib')
                 if os.path.exists(file):
                     incfiles.append(abspath(file))
                 j += 1
@@ -322,7 +328,10 @@ def main(args):
         if makezip:
             zip = zipfile.ZipFile(ar_name, "w", zipfile.ZIP_DEFLATED)
             for file in incfiles:
-                zip.write(file)
+                # Passing unicode object as `arcname` ensures that the files
+                # are stored as utf-8 in the zip, preserving non-ascii
+                # characters properly
+                zip.write(file, unicode(file, sys.getfilesystemencoding()))
             zip.close()
         else:
             tar = tarfile.open(ar_name, "w:gz")
-- 
1.7.9.5

Reply via email to