According to Richard Burns:
> Hi everyone,
> 
> I'm an Ht://dig newbie who has been thru the FAQ several times on the role of the
> external parsers,
> and find myself with a couple of questions that I hope can be answered (or at least
> some guidance or opinion offered) here.
> 
> The FAQ is a bit ambiguous about which parsers to use.
>            - On the one hand it recommends using those on the "contributions" area of
> the htdig site ( which wasn't working btw, but a mirror was) ;
>            - and on the other hand seems to more strongly suggest the doc2html.pl  
>and
> related programs (Written by David Adams (University of Southampton), and based on 
>the
> conv_doc.pl script by Gilles Detillieux.) as a more "complete" solution. However on
> inspection of this latter solution, it relies on a number of other items that are, in
> turn, some what scattered across the web.

doc2html is one of the 3 parsers or converters available in the contrib
area, and it is the best of the 3, with conv_doc.pl coming in 2nd.  The
parse_doc.pl script should really only be used if you're stuck with a
version of htdig below 3.1.4.  The FAQ entries on external parsers haven't
been updated much since 3.1.4 came out, and they do need to be updated.

All 3 of these scripts rely on a number of other conversion filters to
handle the various document types.  doc2html handles the largest selection
of types.  The documentation (or program comments) for all 3 scripts tell
you where you can find the various conversion filters.  You only need to
install filters for the document types you need to convert.  I use xpdf
for PDF files.  Some people also use catdoc for Word files, while others
need a more extensive set of filters because they're indexing all sorts
of things.

> This leads me to want to ask from the community that "knows" what is best to do in a
> practical sense? (I need to be concerned about word 97 docs, PDF's ( various origins)
> and ppt presentations, as well as some visio and other doc's. Basically I am
> experimenting with using ht://dig as the search engine across some intranet 
>accessible
> drives that contain 8000 documents devoted to internal business systems projects.
> Apache is being used to html'ize the directory structure on these drives. Probably 
>30%
> of the documents are already saved in html beside their .doc / .pdf /.ppt ...
> originals. )

You definitely want to go with doc2html.  It has code to handle ppt, given
the proper filter for it, and it's most easily extended for handling other
document types, by adding the necessary entries for other filters.

> Is there a particular mirror, sourceforge project,  or other site, that has it ( the
> external parsers) all in one place?

Not that I know of.  If you're running Red Hat Linux, you may find some or
all the filters you need in RPM form, either right in the distribution, in
the Powertools, or in their contrib files, on your favorite mirror site.
Some other distributions may also have collections of precompiled 3rd
party code.  Other than that, just follow the URLs in the DETAILS file
that comes with doc2html.

> Is there some other question I really should be asking myself before going down this
> external parser road?
> 
> The "convert document of type "yyy" to html" seems a pretty generic need; and 
>although
> many vendors offer ways inside their products to do this a common open source tool
> doing this from the outside seems like a good idea. Is there any project for this
> beyond the doc2html.pl one noted above?

Again, not that I know of.  The thing is while HTML is a pretty generic
format, most of the other ones are more specialised and/or proprietary,
so I wouldn't expect one single author or group of people to do a good
job converting all these formats to HTML.  So, it seems to me that the
doc2html approach is a sound one: leave the fiddly details of reading
specific formats to the experts in those formats, and just provide the
glue to put it all together.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to