Hi,

> Yes, I was able to extract plain text from OOXML, without any format
> code,  is a first step.  But is needed a lot of carefull check of the
> tags to get the rigth and of paragraf.
>
> Make this for example at prompt.
>
> 7z e -so -y yourdocument.doc > outfile.xml
> ex doc2txt.ex outfile.xml
>
> This is the code of doc2txt.ex (requires euphoria interpreter)

Here is a shorter variant, using SED

www.delorie.com/gnu/docs/sed/sed.1.html
ftp://ftp.delorie.com/pub/djgpp/current/v2gnu/

sed -e 's/<[^>]*>//g' < outfile.xml > outfile.txt

You may need extra expressions for rare cases where
there are linebreaks inside tags. You can also add
expressions for example to convert &amp; to & later
in the doc text etc, just use more -e 's/in/out/g'.

Eric ;-)






------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Freedos-user mailing list
Freedos-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-user

Reply via email to