RE: State of the Union for HWPF

Kais Dukes Wed, 27 Apr 2005 09:27:58 -0700

Dear Andy,

I can answer your question with regards to unicode and non-unicode storage.
The fundamental structure in a Word document is known as the piece table. A
piece table is a structure that maps logical parts of document text to
locations in memory (or locations in the Word file, if you want to code
directly against the file format).


Abiword, the open source Word processor for example, has an excellent piece
table implementation. Microsoft Word has a different implementation, which
may be more efficient for very large documents.

You cannot really manipulate lots of text without using an efficient piece
table (or some equivalent text mapping structure) -- for larger documents or
for heavy edits the memory requires will explode (in the case of Java, you
will probably start getting JVM heap errors).

Now, with regards to unicode and non-unicode text, Microsoft Word does
something different to most other word processors. Word uses the piece table
to be able to store unicode and non-unicode text for the same file. What
this means is that Word itself judges which sequences should be unicode, and
which sequences should be a single character code (usually CP1252). It uses
some internal algorithm to decide this. However, if you are talking about
writing a Word document, you can set up your unicode / non-unicode sequence
however you wish, as long as the piece table is implemented correctly.

A "complex file" aka a "fast saved file" actually dumps the piece table
directly to disk. In this case, there are unicode and non-unicode sections
mixed up in the file, as well text stored on disk which isn't actually part
of the document's logical text stream.

A simple (non-complex non-fast saved) Word file will have only a single
character set, usually all unicode in the text stream.

Hope this helps.

Regards
-- Kais

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 27 April 2005 17:01
To: [email protected]
Subject: State of the Union for HWPF


So it looks like if I create a new document or even use one of the word
office templates, I can add all the text I like and can even style it
like existing text.

However, it looks at the moment like:

  * Delete is horribly broken, this needs to be fixed
  * You can do things with HWPF that are really structurally unsound
  * The usage patterns for when to create a Section, Paragraph and
CharacterRun isn't very well defined
  * WHOA MOMMMA there are a lot of methods and constants that you can
set on any given "thing"
  * Ryan apparently did not believe in JavaDoc, Junit (very few tests),
or Documentation (which is why I continually refused to let HWPF out of
scratchpad, which is why the project floundered up until now -- gee
Ryan...maybe thats why its hard to use).

That being said:
  * It seems to be fairly functional even for somewhat complex documents
especially in *reading*
  * SuperLink and its clients may put significant investment into HWPF
in the near future to get it up to spec.

The API needs several refinements:

  * add "cloneProperties" methods to Paragraph and CharacterRun  - (done
but not committed)
  * Why can't a characterRun be added to a paragraph?
  * Why can't a characterRun be deleted from a paragraph?
  * Groupings of similar properties should be broken down into
compositions of objects rather than just one big Mega properties object.
  * Weird word structural abbreviations shouldn't be exposed to the
usermodel
  * Unicode support.

Question:
  * There are a couple of people here with some good Word knowledge..
Can anyone give me some pointers on the difference between unicode text
storage and non unicode text storage?

Glen, Avik and Rainer are scared...commit messages directly from me
again ;-)

-Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/

--
No virus found in this incoming message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.10.3 - Release Date: 25/04/2005

--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.10.3 - Release Date: 25/04/2005


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/

RE: State of the Union for HWPF

Reply via email to