PDF (Acrobat) parser for ht://Dig
---------------------------------

Written by Sylvain Wallez, wallez@mail.dotcom.fr


This patch adds a builtin PDF parser to htDig. It is based on htDig 3.0.8b2.

Parsing is done on PostScript translation of PDF files by Acrobat Reader
(acroread). It is freely available for most platform at www.adobe.com

Using acroread as a translator avoids writing a complicated PDF
parser that can handle the various compression mechanisms available in PDF.
It allows also to keep this parser up to date when Adobe issues a new
release of the PDF specification (PDF spec is available at www.adobe.com)

Installing the PDF parser :
-------------------------
In directory htdig-3.0.8b2/htdig :
- copy files PDF.h and PDF.cc

- if your Document.cc file is the original 3.0.8b2 one, copy also Document.cc,
  otherwise apply the diffs in Document.cc.diff

In file htdig-3.0.8b2/htdig/.sniffdir/ofiles.incl, add " PDF.o" after "main.o".

Now, make and install htdig.

Using the PDF parser :
--------------------
The location of the acroread binary should be specified by a new "acroread"
attribute in the config file (htdig.conf), eg :
  acroread /opt/acroread/bin/acroread

If this attribute is not specified, the parser assumes acroread is in the
PATH and simply calls "acroread".

Once this is done, reindex your sites.

Tips :
----
max_doc_size attribute :
The default max_doc_size of 100k is often to low for PDF files, causing many of
them to be ignored. In my configuration file, I increased it to 1Mb, ie :
  max_doc_size 1000000

PDF files titles :
Using this parser on our Intranet has pointed out that very often, people do
not care of giving a meaningful title to their PDF files. The effect is that
search results show cryptic titles which are the orignal documents file names.
So, inform your PDF producers of this point to have nice search results.

Caveats :
-------
- You need to have acroread installed on the computer used for indexing.
- PDF 1.2 files can contain hrefs, and they are not included in the PostScript
  translation, preventing htDig to follow these links.
- Generated PostScript files can be large. You need free space on your
  temp filesystem.

Thanks :
------
Many thanks to Andrew Scherpbier for this great tool.
ht://Dig is nice on both sides : on the outside, it's very easy to set up and
use and produces clean results, even if disk usage is high :-)
On the inside, the code is very well written and easy to understand.
I hope htdig 4 will be even better (as Andrew, I now use and like Java).
