lurongjiang opened a new issue, #1041:
URL: https://github.com/apache/poi/issues/1041
## Description
DOC files created by WPS Office (Kingsoft Office) cannot be parsed by Apache
POI. The files have valid OLE2 headers (`D0 CF 11 E0`) and are detected as
`application/msword` by Tika, but POI throws `IndexOutOfBoundsException: Block
XX not found` when attempting to parse them.
## Steps to Reproduce
1. Obtain a DOC file created by WPS Office (sample file attached)
2. Try to parse the file using POI HWPFDocument or Tika
Example code:
```java
try (FileInputStream fis = new FileInputStream("sample.doc")) {
HWPFDocument document = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String text = extractor.getText();
} catch (Exception e) {
e.printStackTrace();
}
```
## Expected Behavior
The DOC file should be parsed successfully, returning the document text
content.
## Actual Behavior
```
java.lang.IndexOutOfBoundsException: Block 52 not found
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:474)
...
Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes
from 27136 in stream of length 26750
```
## File Analysis
### File Header
- File size: 26,750 bytes
- Magic bytes: `D0 CF 11 E0 A1 B1 1A E1` (valid OLE2 signature)
- Tika detects as: `application/msword`
### OLE2 Structure (from header)
```
Offset 0x00: D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00
Offset 0x10: 00 00 00 00 00 00 00 00 3E 00 03 00 FE FF 09 00
Offset 0x20: 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00
Offset 0x30: 01 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00
Offset 0x40: 01 00 00 00 FE FF FF FF 00 00 00 00 00 00 00 00
```
- FAT size: 1 byte
- FAT block count: 1
- First FAT sector ID: 0
- MiniFAT block count: -2 (0xFFFFFFFE)
### WPS Signature Found
The file contains WPS-specific metadata:
- `WPS Office_11.1.0.10009`
- `KSOTemplate`
- `KSOProductBuildVer`
### Comparison with WPS "Save As" Fixed File
When the same file is saved again using WPS (Save As -> doc), the new file:
- Size: 25,536 bytes (reduced by 1,214 bytes)
- Header: **Identical** to original file
- Can be parsed successfully by POI
This indicates the issue is in the internal OLE2 structure (FAT chain), not
the file header.
## Environment
- Apache POI version: [5.2.5,5.5.1]
- Java version: 1.8
## Additional Notes
1. **WPS Office can directly open and edit these files** - no issues
2. When using WPS "Save As" (without any modification, just save to same
format), POI can parse the resulting file successfully
[22.doc](https://github.com/user-attachments/files/26533683/22.doc)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]