Dear Andy, I can answer your question with regards to unicode and non-unicode storage. The fundamental structure in a Word document is known as the piece table. A piece table is a structure that maps logical parts of document text to locations in memory (or locations in the Word file, if you want to code directly against the file format).
Abiword, the open source Word processor for example, has an excellent piece table implementation. Microsoft Word has a different implementation, which may be more efficient for very large documents. You cannot really manipulate lots of text without using an efficient piece table (or some equivalent text mapping structure) -- for larger documents or for heavy edits the memory requires will explode (in the case of Java, you will probably start getting JVM heap errors). Now, with regards to unicode and non-unicode text, Microsoft Word does something different to most other word processors. Word uses the piece table to be able to store unicode and non-unicode text for the same file. What this means is that Word itself judges which sequences should be unicode, and which sequences should be a single character code (usually CP1252). It uses some internal algorithm to decide this. However, if you are talking about writing a Word document, you can set up your unicode / non-unicode sequence however you wish, as long as the piece table is implemented correctly. A "complex file" aka a "fast saved file" actually dumps the piece table directly to disk. In this case, there are unicode and non-unicode sections mixed up in the file, as well text stored on disk which isn't actually part of the document's logical text stream. A simple (non-complex non-fast saved) Word file will have only a single character set, usually all unicode in the text stream. Hope this helps. Regards -- Kais -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 27 April 2005 17:01 To: [email protected] Subject: State of the Union for HWPF So it looks like if I create a new document or even use one of the word office templates, I can add all the text I like and can even style it like existing text. However, it looks at the moment like: * Delete is horribly broken, this needs to be fixed * You can do things with HWPF that are really structurally unsound * The usage patterns for when to create a Section, Paragraph and CharacterRun isn't very well defined * WHOA MOMMMA there are a lot of methods and constants that you can set on any given "thing" * Ryan apparently did not believe in JavaDoc, Junit (very few tests), or Documentation (which is why I continually refused to let HWPF out of scratchpad, which is why the project floundered up until now -- gee Ryan...maybe thats why its hard to use). That being said: * It seems to be fairly functional even for somewhat complex documents especially in *reading* * SuperLink and its clients may put significant investment into HWPF in the near future to get it up to spec. The API needs several refinements: * add "cloneProperties" methods to Paragraph and CharacterRun - (done but not committed) * Why can't a characterRun be added to a paragraph? * Why can't a characterRun be deleted from a paragraph? * Groupings of similar properties should be broken down into compositions of objects rather than just one big Mega properties object. * Weird word structural abbreviations shouldn't be exposed to the usermodel * Unicode support. Question: * There are a couple of people here with some good Word knowledge.. Can anyone give me some pointers on the difference between unicode text storage and non unicode text storage? Glen, Avik and Rainer are scared...commit messages directly from me again ;-) -Andy --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta POI Project: http://jakarta.apache.org/poi/ -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.10.3 - Release Date: 25/04/2005 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.10.3 - Release Date: 25/04/2005 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta POI Project: http://jakarta.apache.org/poi/
