>
> Hello,
>
> I ran into an infinite loop using doc2html. When it parses a PDF document it tries
>to reassemble hyphenated words. Unfortunately, I have documents that end with a
>dash, like"text-", so the loop spins forever looking for the other half of the word.
>Adding a check for eof fixed it.
>
> in sub try_text()
>
> while (<CAT>) {
> while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
> ($_ .= <CAT>) || last;
> s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
> }
> --
> while (<CAT>) {
> while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
> ($_ .= <CAT>) || last;
> s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
> + last if eof;
> }
>
>
> Terry Luedtke
> National Library of Medicine
>
This bug fix arrived too late to go into version 2.1 of doc2html.pl
which is now available from the External Parsers section of
http://www.htdig.org/contrib/
Version 2.1 uses both the magic number and the MIME type to decide
which conversion utlitity to use, and is able to cope with:
MS Word (most versions including Word2 and Word for MAC)
MS Excel
MS Powerpoint
Wordperfect (purchase of wp2html necessary)
Adobe PDF
Postscript
RTF
There are number of minor improvements, including a useful improvement
in the conversion of PDF files.
As for the future, the hyphenation code is nearly unchanged from
parsedoc.pl and clearly needs revision. This is not something I am
going to be able to spend much time on in the next few months, so if
someone would volunteer to take over code development I would be very
pleased to hand it on to them.
--
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>