> > I think MS-OOXML also has binary files inside for printer
> > settings or similar sometimes. While the full office formats
> > are indeed extremely complex, you can often get a quite good
> > idea of the text content by unzipping the XML inside which has
> > the focus on content, as opposed to layout etc, and then
> > removing all XML tags and attributes. Example:
> >
> > ...
> > <fancyname fancyproperty="greenish">Hello</fancyname>
> > ...
> >
> > ...would simply be reduced to "Hello", easy to read in DOS.
> > Does anybody know a nice program for that for DOS? When in
> > doubt, you can always use the DJGPP port of GNU textutils,
> > for example the SED tool, to remove the XML markups... ;-)
> I know there is a program on Linux that does exactly that, but I forgot
> its name. Shouldn't be hard to find. If someone could port that...
> --
> Amedee

It wouldn't be too hard to write your own. Just scan through the file and
output everything except what's between "<" and ">".

The problem comes in if there's character codes like "&copy;". That was the
main issue I ran into when trying to write an HTML-to-plaintext converter.


"Those who do not become dishonest or hostile are the most difficult to
debate. Addressing others in a respectful and considerate manner conveys
favorable impressions of their belief system. Providing rational answers to
questions creates positive image of the person and their beliefs. Thank
goodness there are very few such members and they are greatly outnumbered by
those who are irrational, rabid, intolerant, and disrespectful." - An
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
Freedos-user mailing list

Reply via email to