Re: [htdig] parse_doc.pl alterations

David Adams Fri, 26 Nov 1999 01:46:13 -0800
> According to David Adams:
> > I have downloaded the parse_doc.pl script, and the xpdf and catdoc
> > utilities, and I am now using them to extend our search index to include
> > Word and PDF files.  It all works well and with a bit of alteration to
> > the Perl script does exactly what I want.  My thanks to the developers!
> 
> I forgot to ask before, what were your alterations?  Something very
> specific to your needs, or something worth sharing with other?
> 
> -- 
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
> Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

Well, since you ask, I noticed two problems with PDF files on our site:

1.      the titles were often meaningless, having no connection with
        the contents.

2.      pdftotext outputs some spurious non-ascii gibberish that is 
        then indexed.

I modified the code which outputs the title to always include the
type, and to put any extracted title in double quotes or the filename
in square brackets:

# if no title use filename from URL
if (not length($title)) {
        $title = $ARGV[2];
        $title =~ s#^.*/##;
        $title = '[' . $title . ']';
} else {  
        $title = '"' . $title . '"';
}
print "t\t$title ($type Document)\n";


To throw away the spurious "words" I simplified the code to replace
all non-alphanumerics with spaces.  I appreciate that many people would
think that too drastic:


while (<CAT>) {
        while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
                $_ .= <CAT> || break;
                s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
        }
        $head .= " " . $_;
#        s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\$
#        s/[\255]/-/g;                                   # replace dashes with $
        s/\W/ /g;       # replace non-alphanumeric characters with spaces
        s/\s+/ /g;      # replace multiple spaces, etc. with a single space
        @fields = split;                                # split up line
        next if (@fields == 0);                         # skip if no fields (do$
        for ($x=0; $x<@fields; $x++) {                  # check each field if s$
                if (length($fields[$x]) >= $minimum_word_length) {
                        push @allwords, $fields[$x];    # add to list  
                }
        }
}

The spurious output is nolonger indexed, but it does remain in the head,
so there is further room for improvement.

-- 
 
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.
Re: [htdig] parse_doc.pl alterations

Reply via email to