According to Antti Rauramo:
> Yep, great, using xpdf's pdftotext helped! Now also searching pdf's works
> flawlessly! Thank you!
Glad to help.
> > You may want to adapt the script to extract titles from PDFs using
> > pdfinfo, if the titles matter to you. (That's something on my to-do
> > list I can't seem to find the time for.)
>
> Oop, heh, didn't read up to here before already adding a part to parse_doc.pl which
> reads the pdf and finds the title. (Though this may have problems w/ crypted pdf's)
> Here's the cut beginning around line 152...
>
>
> #############################################
> # print out the title
> #@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
> #print "t\t$type Document $temp[-1]\n"; # print it
>
> ### 13-08-1999 ant
> open(TITLEIN,"<$ARGV[0]") || print STDERR "$ARGV[0]: $!\n";
> while(<TITLEIN>){
> if(/title/i){
> ($pdftitle)=/\/Title \(([^\/)]+)[\/\)]/i;
> $pdftitle && close TITLEIN;
> }
> }
> close TITLEIN;
>
> $pdftitle=~s/\\(\d\d\d)/pack(c,oct($1))/ge;
> if(!$pdftitle){ $pdftitle="$type Document $temp[-1]"; }
> print "t\t$pdftitle\n";
I don't know how well pdftotext and pdfinfo deal with encrypted PDFs either.
I think they need patches for this, and somehow need to be given the
decryption keys.
I do see a problem with your approach, though. The first /Title definition
isn't necessarily the one you want. It all depends on how the dictionaries
are laid out in the PDF. Here are my recent changes to parse_doc.pl, which
I posted to http://www.htdig.org/files/contrib/parsers/ and to the 3.2
source tree:
Index: contrib/parse_doc.pl
===================================================================
RCS file: /opt/htdig/cvs/htdig3/contrib/parse_doc.pl,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -r1.5 -r1.6
--- contrib/parse_doc.pl 1999/03/22 21:39:46 1.5
+++ contrib/parse_doc.pl 1999/08/12 22:11:38 1.6
@@ -27,6 +27,11 @@
# (in PDFs) & remove multiple punct. chars. between words (all)
# 1999/03/10
# Changed: fix handling of minimum word length <[EMAIL PROTECTED]>
+# 1999/08/12
+# Changed: adapted for xpdf 0.90 release <[EMAIL PROTECTED]>
+# Added: uses pdfinfo to handle PDF titles <[EMAIL PROTECTED]>
+# Changed: keep hyphens by default, as htdig <[EMAIL PROTECTED]>
+# does, but change dashes to hyphens
#########################################
#
# set this to your MS Word to text converter
@@ -49,11 +54,13 @@
#
$CATPS = "/usr/bin/ps2ascii";
#
-# set this to your PDF to text converter
-# get it from the xpdf 0.80 package at http://www.foolabs.com/xpdf/
+# set this to your PDF to text converter, and pdfinfo tool
+# get it from the xpdf 0.90 package at http://www.foolabs.com/xpdf/
#
$CATPDF = "/usr/bin/pdftotext";
+$PDFINFO = "/usr/bin/pdfinfo";
#$CATPDF = "/usr/local/bin/pdftotext";
+#$PDFINFO = "/usr/local/bin/pdfinfo";
# need some var's
$minimum_word_length = 3;
@@ -64,6 +71,7 @@
@fields = ();
$calc = 0;
$dehyphenate = 0;
+$title = "";
#
# okay. my programming style isn't that nice, but it works...
@@ -97,11 +105,25 @@
}
} elsif ($magic =~ /%PDF-/) { # it's PDF (Acrobat)
$parser = $CATPDF;
- $parsecmd = "$parser $ARGV[0] - |";
-# kludge to handle multi-column PDFs... (needs patched pdftotext)
-# $parsecmd = "$parser -rawdump $ARGV[0] - |";
+ $parsecmd = "$parser -raw $ARGV[0] - |";
+# to handle single-column, strangely laid out PDFs, use coalescing feature...
+# $parsecmd = "$parser $ARGV[0] - |";
$type = "PDF";
$dehyphenate = 1; # PDFs often have hyphenated lines
+ if (open(INFO, "$PDFINFO $ARGV[0] 2>/dev/null |")) {
+ while (<INFO>) {
+ if (/^Title:/) {
+ $title = $_;
+ $title =~ s/^Title:\s+(.*[^\s])\s*$/$1/;
+ $title =~ s/\s+/ /g;
+ $title =~ s/&/\&\;/g;
+ $title =~ s/</\<\;/g;
+ $title =~ s/>/\>\;/g;
+ break;
+ }
+ }
+ close INFO;
+ }
} elsif ($magic =~ /WPC/) { # it's WordPerfect
$parser = $CATWP;
$parsecmd = "$parser $ARGV[0] |";
@@ -135,7 +157,8 @@
s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+\s+|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+$/
/g; # replace reading-chars with space (only at end or begin of word, but allow
multiple characters)
#
s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/
/g; # replace reading-chars with space (only at end or begin of word)
# s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g; # rigorously replace all by
<[EMAIL PROTECTED]>
- s/[\-\255]/ /g; # replace hyphens with space
+# s/[\-\255]/ /g; # replace hyphens with space
+ s/[\255]/-/g; # replace dashes with hyphens
@fields = split; # split up line
next if (@fields == 0); # skip if no fields (does it
speed up?)
for ($x=0; $x<@fields; $x++) { # check each field if string
length >= 3
@@ -150,15 +173,19 @@
exit unless @allwords > 0; # nothing to output
#############################################
-# print out the title
-@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
-print "t\t$type Document $temp[-1]\n"; # print it
+# print out the title, if it's set, and not just a file name
+if ($title !~ /^$/ && $title !~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
+ print "t\t$title\n";
+} else { # otherwise generate a title
+ @temp = split(/\//, $ARGV[2]); # get the filename, get rid of
+basename
+ print "t\t$type Document $temp[-1]\n"; # print it
+}
#############################################
# print out the head
-$head =~ s/^\s+//g;
-$head =~ s/\s+$//g;
+$head =~ s/^\s+//; # remove leading and trailing space
+$head =~ s/\s+$//;
$head =~ s/\s+/ /g;
$head =~ s/&/\&\;/g;
$head =~ s/</\<\;/g;
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.