According to Ronald Edward Petty:
> I was wondering, how does an external parser work for htdig. I got the
> xpdf installed and ran htdig and Im in the mist of trying to figure out if
> it worked to index them. But my question is, when htdig calls the
> external parser from the htdig.conf. I assume if it hits one of the rules
> for pdf it calls that parser. However, how does htdig know to index the
> rusults, i mean does htdig kow that pdftotext makes a .txt file of the
> same name? And then does it destroy it after it has indexed it? Just
> wondering.
> Ron
Let's take a few steps back, first, and look at how htdig fetches
documents. When it sends a request to the HTTP server for a particular
URL, the server responds with a number of headers, as well as the entire
contents of the document. Those headers tell htdig, among other things,
the Last-Modified date of the document, and the Content-Type of the
document.
The Content-Type is what htdig keys on to determine how to parse
the document. It firsts checks the external_parsers attribute to see
if any external parser or external converter is defined for that type.
If there is one, htdig will call it, otherwise it will see if an internal
parser is available for that type. If no external or internal parser
is available at all, it will reject the docuement as not parsable (or
the somewhat misleading "not HTML" error in the 3.1.x series).
Now, when there is an external parser or converter, htdig will store
the whole document it fetched (up to a maximum size of max_doc_size),
which up until now was just stored in a large string buffer, into a
temporary file. It then calls the external parser or converter, passing
it the temprary file name, content-type, document URL and htdig config
file name as arguments, and it will expect the results on the standard
output of the parser or converter. htdig reads those results through a
Unix/Linux pipe which it opened prior to running the parser or converter.
Up until now, the treatment was the same for an external parser or an
external converter. The difference is the format of the resulting output,
and how htdig deals with it. If the definition is something like:
external_parsers: application/pdf->text/html /usr/local/bin/conv_doc.pl
then htdig knows to expect HTML output from the program it calls.
This is therefore an external converter - it converts one content-type
to a different content-type. htdig reads in the results into another
large string buffer, and the whole process essentially starts over again
for the new content-type. Usually it will just pass the results to an
internal parser (i.e. for text/html or text/plain), and that's the end
of it. However, you could conceivably chain together two or more external
converters in this manner, before the final result actually gets parsed,
either by one of the internal parsers, or an external parser.
It the external_parsers definition being used contains just a single
content-type value with no "->", then htdig knows it's calling an
external parser rather than a converter, so there will not be any further
conversion or re-parsing involved. Instead, htdig expects the external
parser to do the actuall parsing, and output simple records indicating
the different elements of the document, i.e. individual words and their
location and context, individual links, titles, excerpt, and so on,
all in the format laid out in the ht://Dig documentation.
As you might expect, writing a complete external parser is much more
involved, and much trickier to get right, than simply writing an external
converter. This is why we've recommended writing converters rather than
parsers, since the support for external converters has been in place.
To answer your question more directly, about how htdig knows that
pdftotext makes a .txt file, the answer is it doesn't. htdig doesn't
call pdftotext directly, but instead calls a script that handles all the
details of calling the actual conversion program and getting the results
just the way htdig expects. In the case of a PDF file, conv_doc.pl and
doc2html.pl both call pdfinfo first to get the PDF title, then they call
pdftotext with a "-" as the output file name, so they can read the output
from the standard output, rather than from a generated .txt file, again
reading the output through another pipe. Both scripts do some simple
post-processing, like dehyphenation, and spit out the text in a a very
simple, not very formatted, HTML code so that the internal text/html
parser can pick out the title from the text. The only temporary file
involved is the one in which the PDF file is stored after fetching, and
htdig knows to remove it after the external converter (or parser) is
done with it. Everything else is handled through pipes.
It would be possible for htdig to call a conversion program directly,
without the use of a wrapper script of this sort, but only if the
conversion program correctly handled the arguments htdig passes (or
safely ignores those it doesn't need), and puts out its output on the
standard output just as htdig expects.
I hope this rather lengthy explanation helps clarify matters.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-general