According to Wim Kosten:
> Using the external converter switch (application/pdf->text/html) I index 
> PDF files which works perfectly
> however the PDF files (abt, 8000) all have undescriptive titles like 
> "Word doc 2" instead of "Proposal for the yearly members meeting".
> 
> In order to properly show the titles I use a small script which uses the 
> ASCII dump (-t switch), rewrite that with correct titles
> and with htload I load it into the DB2 database. After that U run 
> htmerge and htfuzzy and it all seems to work ...
> 
> But for a weird reason I can't search for words in those patched titles. 
> If I wanted to search for proposal my example patched
> document would not be found.
> 
> I started digging deeper and I (just for the testcase) used a small 
> script which returns "nanananana" instead of the PDF title returned by 
> pdfinfo
> and guess: the "nananan" will be found. So that seemed to be the 
> sollution, but .... when htdig calls the converter it gives the
> tmp (downloaded PDF) name instead of the actual document name. In this 
> way it's pretty hard to set the correct title if I only know the tmp name
> and not the PDF file or the location (URL) of the document.
> 
> Is it a (good) idea to do rewrites straight into the DB2 database and do 
> I have to reindex etc or are there better options. Is it an idea to use the
> external parser instead of the external converter and write my own htdig 
> records ?

I would see making changes to the databases after the fact as an absolute
last recourse.  I'm a big believer in external converters, rather than
external parsers, because they still give you pretty much all the control
you need but without a lot of the added complexities.  External parsers
are very hard to get right, but it's relatively easy to generate HTML
code that will get parsed the way you want.

If you have a technique for getting the titles you want for your PDF files,
then it should be fairly easy to fit this into the pdf2html.pl script.
External converter (or parser) scripts get the full URL to the document
as their third argument, after the temporary file name and the MIME type.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to