Also sprach Malka Cymbalista (at 11:15 AM 7/21/98 +0300) ...
>Can someone please explain how to get htdig to index pdf files. I'm new
>to this list and since I'm on it I don't remember seeing anything about
>how to install a patch. I checked the documentation but couldn't find
>anything.
First you need to install Adobe Acrobat Reader on your server. Get the
latest version from:
http://www.adobe.com
Second, you need to run the patch that's included in htdig-pdf.tgz. I will
send this file to you seperately, so as not to clutter the list. (Is there
a location for this file on the htdig site yet?) Don't compile it yet
because ...
Third, make the following changes, as pointed out by Sylvain Wallez:
<start quote>
The first one is a bug in PDF.cc (doesn't seem to happen on the PDF
files on my Intranet, but we only use Acrobat to produce them). Here's
the diff he sent me :
diff -c htdig/PDF.cc.old htdig/PDF.cc
*** htdig/PDF.cc.old Wed Jul 15 10:46:03 1998
--- htdig/PDF.cc Tue Jul 14 10:21:38 1998
***************
*** 280,286 ****
}
}
! else if (line == "BT")
{
// Beginning of text block
if (debug > 3)
--- 280,286 ----
}
}
! else if ( mystrncasecmp( line.get(), "BT", 2 ) == 0 )
{
// Beginning of text block
if (debug > 3)
The second problem is that the default value for the "bad_extension"
attribute contains .pdf, which causes all pdf files to be ignored by
htdig, even if a parser is available.
To correct this, you can either put a "bad_extension" list without
".pdf" in your config file (this is what I did), of apply the following
patch to htcommon/defaults.cc :
diff -c htcommon/defaults.cc.old htcommon/defaults.cc
*** htcommon/defaults.cc.old Fri Aug 15 01:59:25 1997
--- htcommon/defaults.cc Mon Jul 13 19:37:33 1998
***************
*** 37,43 ****
{"add_anchors_to_excerpt", "true"},
{"allow_numbers", "false"},
{"allow_virtual_hosts", "true"},
! {"bad_extensions", ".wav .gz .z .sit .au .zip .tar
.hqx .exe .com .gif .jpg .jpeg .aiff .pdf .class .map .ram"},
{"bad_word_list", "${common_dir}/bad_words"},
{"create_image_list", "false"},
{"create_url_list", "false"},
--- 37,43 ----
{"add_anchors_to_excerpt", "true"},
{"allow_numbers", "false"},
{"allow_virtual_hosts", "true"},
! {"bad_extensions", ".wav .gz .z .sit .au .zip .tar
.hqx .exe .com .gif .jpg .jpeg .aiff .class .map .ram"},
{"bad_word_list", "${common_dir}/bad_words"},
{"create_image_list", "false"},
{"create_url_list", "false"},
Thanks to M.J. Long for bug hunting.
<end quote>
Now, you can do a configure, make clean, make and make install. Voila, PDF
parsing!
.........................................................................
Colin Viebrock Creative Director - Private World Communciations
[EMAIL PROTECTED] http://www.privateworld.com
ICQ: 11386088
There are threee erors in this sentence.
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.