Martin Burrow wrote:
> Hi everyone,
>
>
>
> I'm interested in using the POI package in order to extract content from
> a MS Word document. I've managed to get it do to this, but the
> extracted text is stripped of all style information, just plain text,
> e.g.
>
>
>
> The quick brown fox jumps over the lazy dog.
>
>
>
> What I'm looking to do is also show which text is in bold or italics.
> So for example it would output:
>
>
>
> The [b]quick[/b] brown fox [i]jumps over[/i] the lazy dog.
>
>
>
> Or failing this, can the document be outputted as an XML document that
> also contains style information?
>
Hello Martin,
i'm currently working on this problem too. But i think POI is currently
not ready for our wishes :(
I've found 2 other solutions for the doc2xml-problem:
1. a python-skript called doc2xml
* http://pair.mbl.ca/doc2xml/
* GPL
* this skript can read word 97, word 2000 and word 2002
* the xml-output contain all stylsheets!
2. the libwv
* http://wvware.sourceforge.net/
* GPL
* currently i'm writing a java-wrapper (jni) for this library
* libwv is used in kword and abiword
>
>
> Is there any way of doing this using the standard POI package? I
> believe this would definitely be possible using POI/HWPF? I visited the
> HWPF project page but couldn't see where to download the source code -
> could someone point me in the right direction?
>
I'm using the svn of POI:
http://svn.apache.org/repos/asf/jakarta/poi/trunk/src
Details on the poi-site (http://jakarta.apache.org/poi/)
Regards
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/