I have an odd problem with some oldish WordPerfect files. It seems to be due to the WordPerfect file magic string WPC being offset in some older WP files, i.e. it's not at byte 2 of the file.

I've been debugging a couple of problems with Larry's WordPerfect Indexer plug-in for Google Desktop Search. Some files index OK, quite a few don't. Some of those that don't index appear to be due to an issue with GDS itself, but some appear to be due either to an issue with the file itself or with libwpd.

Here's what I've noticed about the files that don't index, excluding the ones that might be a GDS problem:

1. All the files that don't index were created in 1997, 1998, and early 1999 - I think that they might have been created with WP7. I certainly upgraded to WP8 some time in 1999.

2. All the files can be read OK by
  a) WP 8.0.0.710
  b) OpenOffice 2.2.0
  c) wpd2text.exe File Version 0.8.9.4573 (although the "Other version information" has File Version = 0.8.8.70109)

3. Larry's indexer uses a WPXMemoryInputStream to process the file; I think wpd2text uses a GSFInputStream (I'm assuming it's built from the same source as the one I have built on MinGW on another machine).

4. I've rebuilt Larry's indexer using MS Visual Studio with both libwpd 0.8.8 and 0.8.9 - the behaviour is the same with each (not really surprising as WPXHeader.cpp is essentially identical in both)

5. If I open one of these files in WP8 and resave it, the resaved file indexes fine.

6. The error that is occurring is due to WPXHeader::constructHeader() not being able to build a header, specifically the file magic is not equal to "WPC".

7. Opening two original and resaved files, the first 16 bytes of each are:
File 1 original:
0000000 d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00 00 00
        320 317 021 340 241 261 032 341  \0  \0  \0  \0  \0  \0  \0  \0
File 1 resaved:
0000000 ff 57 50 43 ef 08 00 00 01 0a 02 01 00 00 00 02
        377   W   P   C 357  \b  \0  \0 001  \n 002 001  \0  \0  \0 002
File 2 original:
0000000 d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00 00 00
        320 317 021 340 241 261 032 341  \0  \0  \0  \0  \0  \0  \0  \0
File 2 resaved:
0000000 ff 57 50 43 3d 1c 00 00 01 0a 02 01 00 00 00 02
        377   W   P   C   = 034  \0  \0 001  \n 002 001  \0  \0  \0 002

8. The original files seem to have a directory or TOC of some sort at the beginning, and the actual WP header is offset into the file. For File 1, it's offset 2KB in:
0004000 ff 57 50 43 c5 08 00 00 01 0a 02 02 00 00 00 02
        377   W   P   C 305  \b  \0  \0 001  \n 002 002  \0  \0  \0 002
For File 2 it's offset 6KB in:
0014000 ff 57 50 43 34 1d 00 00 01 0a 02 02 00 00 00 02
        377   W   P   C   4 035  \0  \0 001  \n 002 002  \0  \0  \0 002

The first question I guess is why does wpd2text and all the others (that I assume use GSFInputStream) work and the program that uses WPXMemoryInputStream not work?

Next question: apart from re-saving all the files, is there any way to persuade libwpd to process the files, i.e. to determine the length of the "pre-header", whatever it is, and seek past it?

Thanks, any help appreciated.

regards - David
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Libwpd-devel mailing list
Libwpd-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libwpd-devel

Reply via email to