Hello Gilles!
> On the other hand, pdftotext (part of the xpdf package) seems to handle
[...]
Yep, great, using xpdf's pdftotext helped! Now also searching pdf's works
flawlessly! Thank you!
> You may want to adapt the script to extract titles from PDFs using
> pdfinfo, if the titles matter to you. (That's something on my to-do
> list I can't seem to find the time for.)
Oop, heh, didn't read up to here before already adding a part to parse_doc.pl which
reads the pdf and finds the title. (Though this may have problems w/ crypted pdf's)
Here's the cut beginning around line 152...
#############################################
# print out the title
#@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
#print "t\t$type Document $temp[-1]\n"; # print it
### 13-08-1999 ant
open(TITLEIN,"<$ARGV[0]") || print STDERR "$ARGV[0]: $!\n";
while(<TITLEIN>){
if(/title/i){
($pdftitle)=/\/Title \(([^\/)]+)[\/\)]/i;
$pdftitle && close TITLEIN;
}
}
close TITLEIN;
$pdftitle=~s/\\(\d\d\d)/pack(c,oct($1))/ge;
if(!$pdftitle){ $pdftitle="$type Document $temp[-1]"; }
print "t\t$pdftitle\n";
--
- Antti Rauramo, WWW- ja tietokanta-asiantuntija, Edita Verkkoviestint�
- [EMAIL PROTECTED], +358-9-8501 4004 (mobile)
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.