Re: [htdig] Using pdftotext to index PDF documents

Gilles Detillieux Thu, 25 Feb 1999 17:07:09 -0500

> There's still a bit more work to be done.  Patrick mentioned that
> pdftotext changed hyphens to spaces.  Not so, but parse_doc.pl does.
> In fact, it converts all punctuation to spaces, to separate out the words.
> The problem is right now, the word list is what it spits out for the
> "h" record as well.  So there's no punctuation at all in the excerpts!

OK, here's take 2 on my parse_doc.pl patch, to support pdftotext.
Apart from some cleaning up, and the same additions as my earlier (and
now obsolete) patch, it builds a separate string for the head record,
with processing on it equivalent to what htdig does on plain text files.

It seems to work like a charm on my PDFs (with the patch to pdftotext
I posted earlier).  I'd like a few other PDF users to try it out as an
external parser for application/pdf documents on their systems.  Also,
if anyone with more perl experience than me (going on a few hours now)
can critique the code - either my changes or the original code - I'd
appreciate the edification.

You can pick up the latest script from

        http://www.scrc.umanitoba.ca/htdig/rpms/parse_doc.pl

or apply the patch below.  This patch should be applied to the original
contrib/parse_doc.pl shipped with htdig-3.1.1.tar.gz:

--- contrib/parse_doc.pl.nopdf  Tue Feb 16 23:03:39 1999
+++ contrib/parse_doc.pl        Thu Feb 25 15:16:43 1999
@@ -10,9 +10,15 @@
 # Changed:      push line semi-colomn wrong.         <[EMAIL PROTECTED]>
 # Changed:      matching works for end of lines now  <[EMAIL PROTECTED]>
 # Added:        option to rigorously delete all punctuation <[EMAIL PROTECTED]>
+#
+# 1999/02/09
 # Added:        option to delete all hyphens         <[EMAIL PROTECTED]>
-# Changed:      uses ps2ascii to handle PS files     <[EMAIL PROTECTED]>
+# Added:        uses ps2ascii to handle PS files     <[EMAIL PROTECTED]>
+# 1999/02/15
 # Added:        check for some file formats          
<[EMAIL PROTECTED]>
+# 1999/02/25
+# Added:        uses pdftotext to handle PDF files   <[EMAIL PROTECTED]>
+# Changed:      generates a head record with punct.  <[EMAIL PROTECTED]>
 #########################################
 #
 # set this to your MS Word to text converter
@@ -34,8 +40,14 @@
 # get it from the ghostscript 3.33 (or later) package
 #
 $CATPS = "/usr/bin/ps2ascii";
+#
+# set this to your PDF to text converter
+# get it from the xpdf 0.80 package at http://www.foolabs.com/xpdf/
+#
+$CATPDF = "/usr/bin/pdftotext";
 
 # need some var's
+$head = "";
 @allwords = ();
 @temp = ();
 $x = 0;
@@ -57,6 +69,10 @@
         $parser = $CATPS;               # gs 3.33 leaves _temp_.??? files in .
         $parsecmd = "(cd /tmp; $parser; rm -f _temp_.???) < $ARGV[0] |";
         $type = "PostScript";
+} elsif ($magic =~ /%PDF-/) {           # it's PDF (Acrobat)
+        $parser = $CATPDF;
+        $parsecmd = "$parser $ARGV[0] - |";
+        $type = "PDF";
 } elsif ($magic =~ /WPC/) { # it's WordPerfect
         $parser = $CATWP;
         $parsecmd = "$parser $ARGV[0] |";
@@ -77,6 +93,7 @@
 # open it
 open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using pipe.\n";
 while (<CAT>) {
+        $head .= " " . $_;
         
s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/
 /g;    # replace reading-chars with space (only at end or begin of word)
 #       s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g;      # rigorously replace all by 
<[EMAIL PROTECTED]>
         s/-/ /g;                                        # replace hyphens with space
@@ -101,15 +118,22 @@
 
 #############################################
 # print out the head
-$calc = @allwords;
-print "h\t";
-#if ($calc >100) {                      # but not more than 100 words
-#       $calc = 100;
+$head =~ s/^\s+//g;
+$head =~ s/\s+$//g;
+$head =~ s/\s+/ /g;
+$head =~ s/&/\&amp\;/g;
+$head =~ s/</\&lt\;/g;
+$head =~ s/>/\&gt\;/g;
+print "h\t$head\n";
+#$calc = @allwords;
+#print "h\t";
+##if ($calc >100) {                      # but not more than 100 words
+##       $calc = 100;
+##}
+#for ($x=0; $x<$calc; $x++) {            # print out the words for the exerpt
+#        print "$allwords[$x] ";
 #}
-for ($x=0; $x<$calc; $x++) {            # print out the words for the exerpt
-        print "$allwords[$x] ";
-}
-print "\n";
+#print "\n";
 
 
 #############################################

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig] Using pdftotext to index PDF documents

Reply via email to