[htdig-dev] Updating document titles

Wim Kosten Wed, 19 Nov 2003 02:47:05 -0800

I've got the following challenge:

Using the external converter switch (application/pdf->text/html) I index PDF files which works perfectly however the PDF files (abt, 8000) all have undescriptive titles like "Word doc 2" instead of "Proposal for the yearly members meeting".

In order to properly show the titles I use a small script which uses the ASCII dump (-t switch), rewrite that with correct titles and with htload I load it into the DB2 database. After that U run htmerge and htfuzzy and it all seems to work ...

But for a weird reason I can't search for words in those patched titles. If I wanted to search for proposal my example patched document would not be found.

I started digging deeper and I (just for the testcase) used a small script which returns "nanananana" instead of the PDF title returned by pdfinfo and guess: the "nananan" will be found. So that seemed to be the sollution, but .... when htdig calls the converter it gives the tmp (downloaded PDF) name instead of the actual document name. In this way it's pretty hard to set the correct title if I only know the tmp name and not the PDF file or the location (URL) of the document.

Is it a (good) idea to do rewrites straight into the DB2 database and do I have to reindex etc or are there better options. Is it an idea to use the external parser instead of the external converter and write my own htdig records ?

Cheers,

Wim Kosten


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

[htdig-dev] Updating document titles

Reply via email to