Re: [htdig] infinite loop in doc2html.pl

David Adams Wed, 20 Sep 2000 01:11:43 -0700
> 
> Hello,
> 
> I ran into an infinite loop using doc2html.  When it parses a PDF document it tries 
>to reassemble hyphenated words.  Unfortunately, I have documents that end with a 
>dash, like"text-", so the loop spins forever looking for the other half of the word.  
>Adding a check for eof fixed it.
> 
> in sub try_text()
> 
>       while (<CAT>) {
>         while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
>           ($_ .= <CAT>) || last;
>           s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
>         }
> --
>       while (<CAT>) {
>         while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
>           ($_ .= <CAT>) || last;
>           s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
> +          last if eof;
>         }
> 
> 
> Terry Luedtke
> National Library of Medicine
> 

This bug fix arrived too late to go into version 2.1 of doc2html.pl
which is now available from the External Parsers section of
http://www.htdig.org/contrib/

Version 2.1 uses both the magic number and the MIME type to decide
which conversion utlitity to use, and is able to cope with:

        MS Word (most versions including Word2 and Word for MAC)
        MS Excel
        MS Powerpoint
        Wordperfect (purchase of wp2html necessary)
        Adobe PDF
        Postscript
        RTF

There are number of minor improvements, including a useful improvement
in the conversion of PDF files.

As for the future, the hyphenation code is nearly unchanged from
parsedoc.pl and clearly needs revision.  This is not something I am
going to be able to spend much time on in the next few months, so if
someone would volunteer to take over code development I would be very
pleased to hand it on to them. 


-- 
 
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>
Re: [htdig] infinite loop in doc2html.pl

Reply via email to