https://bz.apache.org/bugzilla/show_bug.cgi?id=59747

            Bug ID: 59747
           Summary: xlsx file does not conform to bit patterns used by
                    common file type detection software
           Product: POI
           Version: 3.14-FINAL
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XSSF
          Assignee: [email protected]
          Reporter: [email protected]

Hi,

I'm creating this bug due to a problem we've encountered with POI generated
xlsx files.

Apparently the order of zip entries in xlsx files is important for tools which
determine the file type be matching a byte pattern. See for example Apache Tika
(without deeper OOXML support library) and linux's file command.

The OOXML spec and Excel have no problem with POI files but tools relying on a
certain pattern have.

Here the output of unzip -l on a POI xlsx file:

Archive:  poi.xlsx
  Length     Date   Time    Name
 --------    ----   ----    ----
      591  02.06.16 12:40   _rels/.rels
     1063  02.06.16 12:40   [Content_Types].xml
      183  02.06.16 12:40   docProps/app.xml
      437  02.06.16 12:40   docProps/core.xml
      137  02.06.16 12:40   xl/sharedStrings.xml
      818  02.06.16 12:40   xl/styles.xml
      349  02.06.16 12:40   xl/workbook.xml
      569  02.06.16 12:40   xl/_rels/workbook.xml.rels
      670  02.06.16 12:40   xl/worksheets/sheet1.xml
 --------                   -------
     4817                   9 files

And for a native file:

Archive:  excel.xlsx
  Length     Date   Time    Name
 --------    ----   ----    ----
     1032  01.01.80 00:00   [Content_Types].xml
      588  01.01.80 00:00   _rels/.rels
      557  01.01.80 00:00   xl/_rels/workbook.xml.rels
      906  01.01.80 00:00   xl/workbook.xml
     1542  01.01.80 00:00   xl/styles.xml
     6790  01.01.80 00:00   xl/theme/theme1.xml
     1306  01.01.80 00:00   xl/worksheets/sheet1.xml
      593  01.01.80 00:00   docProps/core.xml
      816  01.01.80 00:00   docProps/app.xml
 --------                   -------
    14130                   9 files

According to linux file and Tika they seem to expect [Content_Types].xml as the
first entry, skip the second and look for a "xl/" in the third entry.

Would it be possible to fix the order of the entries?

We've written a simple post processing tool which rewrites the zip file but
would be happy to have this in POI proper.

Thanks and contact me if I can help.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to