lurongjiang opened a new issue, #1041:
URL: https://github.com/apache/poi/issues/1041

   
   ## Description
   
   DOC files created by WPS Office (Kingsoft Office) cannot be parsed by Apache 
POI. The files have valid OLE2 headers (`D0 CF 11 E0`) and are detected as 
`application/msword` by Tika, but POI throws `IndexOutOfBoundsException: Block 
XX not found` when attempting to parse them.
   
   ## Steps to Reproduce
   
   1. Obtain a DOC file created by WPS Office (sample file attached)
   2. Try to parse the file using POI HWPFDocument or Tika
   
   Example code:
   ```java
   try (FileInputStream fis = new FileInputStream("sample.doc")) {
       HWPFDocument document = new HWPFDocument(fis);
       WordExtractor extractor = new WordExtractor(document);
       String text = extractor.getText();
   } catch (Exception e) {
       e.printStackTrace();
   }
   ```
   
   ## Expected Behavior
   
   The DOC file should be parsed successfully, returning the document text 
content.
   
   ## Actual Behavior
   
   ```
   java.lang.IndexOutOfBoundsException: Block 52 not found
       at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:474)
       ...
   Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes 
from 27136 in stream of length 26750
   ```
   
   ## File Analysis
   
   ### File Header
   - File size: 26,750 bytes
   - Magic bytes: `D0 CF 11 E0 A1 B1 1A E1` (valid OLE2 signature)
   - Tika detects as: `application/msword`
   
   ### OLE2 Structure (from header)
   ```
   Offset 0x00: D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00
   Offset 0x10: 00 00 00 00 00 00 00 00 3E 00 03 00 FE FF 09 00
   Offset 0x20: 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
   Offset 0x30: 01 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00
   Offset 0x40: 01 00 00 00 FE FF FF FF 00 00 00 00 00 00 00 00
   ```
   
   - FAT size: 1 byte
   - FAT block count: 1
   - First FAT sector ID: 0
   - MiniFAT block count: -2 (0xFFFFFFFE)
   
   ### WPS Signature Found
   The file contains WPS-specific metadata:
   - `WPS Office_11.1.0.10009`
   - `KSOTemplate`
   - `KSOProductBuildVer`
   
   ### Comparison with WPS "Save As" Fixed File
   When the same file is saved again using WPS (Save As -> doc), the new file:
   - Size: 25,536 bytes (reduced by 1,214 bytes)
   - Header: **Identical** to original file
   - Can be parsed successfully by POI
   
   This indicates the issue is in the internal OLE2 structure (FAT chain), not 
the file header.
   
   ## Environment
   
   - Apache POI version: [5.2.5,5.5.1]
   - Java version: 1.8
   
   
   ## Additional Notes
   
   1. **WPS Office can directly open and edit these files** - no issues
   2. When using WPS "Save As" (without any modification, just save to same 
format), POI can parse the resulting file successfully
   
   
   
   [22.doc](https://github.com/user-attachments/files/26533683/22.doc)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to