Re: [htdig] PDF & ISO-Latin chars

Antti Rauramo Fri, 13 Aug 1999 03:43:51 -0700

Hello Gilles!

> On the other hand, pdftotext (part of the xpdf package) seems to handle

[...]

Yep, great, using xpdf's pdftotext helped! Now also searching pdf's works
flawlessly! Thank you!



> You may want to adapt the script to extract titles from PDFs using
> pdfinfo, if the titles matter to you.  (That's something on my to-do
> list I can't seem to find the time for.)

Oop, heh, didn't read up to here before already adding a part to parse_doc.pl which
reads the pdf and finds the title. (Though this may have problems w/ crypted pdf's)
Here's the cut beginning around line 152...


#############################################
# print out the title
#@temp = split(/\//, $ARGV[2]);          # get the filename, get rid of basename
#print "t\t$type Document $temp[-1]\n";  # print it

### 13-08-1999 ant
open(TITLEIN,"<$ARGV[0]") || print STDERR "$ARGV[0]: $!\n";
while(<TITLEIN>){
  if(/title/i){
    ($pdftitle)=/\/Title \(([^\/)]+)[\/\)]/i;
    $pdftitle && close TITLEIN;
  }
}
close TITLEIN;

$pdftitle=~s/\\(\d\d\d)/pack(c,oct($1))/ge;
if(!$pdftitle){ $pdftitle="$type Document $temp[-1]"; }
print "t\t$pdftitle\n";


--
- Antti Rauramo, WWW- ja tietokanta-asiantuntija, Edita Verkkoviestint�
- [EMAIL PROTECTED], +358-9-8501 4004 (mobile)



------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.
Re: [htdig] PDF & ISO-Latin chars

Reply via email to