Re: [htdig] PDF & ISO-Latin chars

Gilles Detillieux Fri, 13 Aug 1999 09:44:57 -0700

According to Antti Rauramo:
> Yep, great, using xpdf's pdftotext helped! Now also searching pdf's works
> flawlessly! Thank you!

Glad to help.

> > You may want to adapt the script to extract titles from PDFs using
> > pdfinfo, if the titles matter to you.  (That's something on my to-do
> > list I can't seem to find the time for.)
> 
> Oop, heh, didn't read up to here before already adding a part to parse_doc.pl which
> reads the pdf and finds the title. (Though this may have problems w/ crypted pdf's)
> Here's the cut beginning around line 152...
> 
> 
> #############################################
> # print out the title
> #@temp = split(/\//, $ARGV[2]);          # get the filename, get rid of basename
> #print "t\t$type Document $temp[-1]\n";  # print it
> 
> ### 13-08-1999 ant
> open(TITLEIN,"<$ARGV[0]") || print STDERR "$ARGV[0]: $!\n";
> while(<TITLEIN>){
>   if(/title/i){
>     ($pdftitle)=/\/Title \(([^\/)]+)[\/\)]/i;
>     $pdftitle && close TITLEIN;
>   }
> }
> close TITLEIN;
> 
> $pdftitle=~s/\\(\d\d\d)/pack(c,oct($1))/ge;
> if(!$pdftitle){ $pdftitle="$type Document $temp[-1]"; }
> print "t\t$pdftitle\n";

I don't know how well pdftotext and pdfinfo deal with encrypted PDFs either.
I think they need patches for this, and somehow need to be given the
decryption keys.

I do see a problem with your approach, though.  The first /Title definition
isn't necessarily the one you want.  It all depends on how the dictionaries
are laid out in the PDF.  Here are my recent changes to parse_doc.pl, which
I posted to http://www.htdig.org/files/contrib/parsers/ and to the 3.2
source tree:

Index: contrib/parse_doc.pl
===================================================================
RCS file: /opt/htdig/cvs/htdig3/contrib/parse_doc.pl,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -r1.5 -r1.6
--- contrib/parse_doc.pl        1999/03/22 21:39:46     1.5
+++ contrib/parse_doc.pl        1999/08/12 22:11:38     1.6
@@ -27,6 +27,11 @@
 #               (in PDFs) & remove multiple punct. chars. between words (all)
 # 1999/03/10
 # Changed:      fix handling of minimum word length  <[EMAIL PROTECTED]>
+# 1999/08/12
+# Changed:      adapted for xpdf 0.90 release        <[EMAIL PROTECTED]>
+# Added:        uses pdfinfo to handle PDF titles    <[EMAIL PROTECTED]>
+# Changed:      keep hyphens by default, as htdig    <[EMAIL PROTECTED]>
+#               does, but change dashes to hyphens
 #########################################
 #
 # set this to your MS Word to text converter
@@ -49,11 +54,13 @@
 #
 $CATPS = "/usr/bin/ps2ascii";
 #
-# set this to your PDF to text converter
-# get it from the xpdf 0.80 package at http://www.foolabs.com/xpdf/
+# set this to your PDF to text converter, and pdfinfo tool
+# get it from the xpdf 0.90 package at http://www.foolabs.com/xpdf/
 #
 $CATPDF = "/usr/bin/pdftotext";
+$PDFINFO = "/usr/bin/pdfinfo";
 #$CATPDF = "/usr/local/bin/pdftotext";
+#$PDFINFO = "/usr/local/bin/pdfinfo";
 
 # need some var's
 $minimum_word_length = 3;
@@ -64,6 +71,7 @@
 @fields = ();
 $calc = 0;
 $dehyphenate = 0;
+$title = "";
 #
 # okay. my programming style isn't that nice, but it works...
 
@@ -97,11 +105,25 @@
         }
 } elsif ($magic =~ /%PDF-/) {           # it's PDF (Acrobat)
         $parser = $CATPDF;
-        $parsecmd = "$parser $ARGV[0] - |";
-# kludge to handle multi-column PDFs...  (needs patched pdftotext)
-#       $parsecmd = "$parser -rawdump $ARGV[0] - |";
+        $parsecmd = "$parser -raw $ARGV[0] - |";
+# to handle single-column, strangely laid out PDFs, use coalescing feature...
+#       $parsecmd = "$parser $ARGV[0] - |";
         $type = "PDF";
         $dehyphenate = 1;               # PDFs often have hyphenated lines
+        if (open(INFO, "$PDFINFO $ARGV[0] 2>/dev/null |")) {
+                while (<INFO>) {
+                        if (/^Title:/) {
+                                $title = $_;
+                                $title =~ s/^Title:\s+(.*[^\s])\s*$/$1/;
+                                $title =~ s/\s+/ /g;
+                                $title =~ s/&/\&amp\;/g;
+                                $title =~ s/</\&lt\;/g;
+                                $title =~ s/>/\&gt\;/g;
+                                break;
+                        }
+                }
+                close INFO;
+        }
 } elsif ($magic =~ /WPC/) {             # it's WordPerfect
         $parser = $CATWP;
         $parsecmd = "$parser $ARGV[0] |";
@@ -135,7 +157,8 @@
         
s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+\s+|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+$/
 /g;    # replace reading-chars with space (only at end or begin of word, but allow 
multiple characters)
 #       
s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/
 /g;    # replace reading-chars with space (only at end or begin of word)
 #       s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g;      # rigorously replace all by 
<[EMAIL PROTECTED]>
-        s/[\-\255]/ /g;                                 # replace hyphens with space
+#       s/[\-\255]/ /g;                                 # replace hyphens with space
+        s/[\255]/-/g;                                   # replace dashes with hyphens
         @fields = split;                                # split up line
         next if (@fields == 0);                         # skip if no fields (does it 
speed up?)
         for ($x=0; $x<@fields; $x++) {                  # check each field if string 
length >= 3
@@ -150,15 +173,19 @@
 exit unless @allwords > 0;              # nothing to output
 
 #############################################
-# print out the title
-@temp = split(/\//, $ARGV[2]);          # get the filename, get rid of basename
-print "t\t$type Document $temp[-1]\n";  # print it
+# print out the title, if it's set, and not just a file name
+if ($title !~ /^$/ && $title !~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
+        print "t\t$title\n";
+} else {                                        # otherwise generate a title
+        @temp = split(/\//, $ARGV[2]);          # get the filename, get rid of 
+basename
+        print "t\t$type Document $temp[-1]\n";  # print it
+}
 
 
 #############################################
 # print out the head
-$head =~ s/^\s+//g;
-$head =~ s/\s+$//g;
+$head =~ s/^\s+//;                      # remove leading and trailing space
+$head =~ s/\s+$//;
 $head =~ s/\s+/ /g;
 $head =~ s/&/\&amp\;/g;
 $head =~ s/</\&lt\;/g;

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.
Re: [htdig] PDF & ISO-Latin chars

Reply via email to