Re: [htdig-dev] Updating document titles

Wim Kosten Wed, 19 Nov 2003 11:55:23 -0800

Hi Gilles,

I use as external converter conv_doc.pl which handles PDF, Word, RTF and Excel and when testing it ofcourse only passes the temporary file to pdfinfo / pdftotext.

Did some small tests with my own script which still calls pdfinfo / pdftotext to generate a HTML document but it now can use an Oracle db as fall back for titles and or metainfo.

Thanks !!

Wim

Gilles Detillieux wrote:

According to Wim Kosten:

Using the external converter switch (application/pdf->text/html) I index PDF files which works perfectly however the PDF files (abt, 8000) all have undescriptive titles like "Word doc 2" instead of "Proposal for the yearly members meeting".

In order to properly show the titles I use a small script which uses the ASCII dump (-t switch), rewrite that with correct titles and with htload I load it into the DB2 database. After that U run htmerge and htfuzzy and it all seems to work ...

But for a weird reason I can't search for words in those patched titles. If I wanted to search for proposal my example patched document would not be found.

I started digging deeper and I (just for the testcase) used a small script which returns "nanananana" instead of the PDF title returned by pdfinfo and guess: the "nananan" will be found. So that seemed to be the sollution, but .... when htdig calls the converter it gives the tmp (downloaded PDF) name instead of the actual document name. In this way it's pretty hard to set the correct title if I only know the tmp name and not the PDF file or the location (URL) of the document.

Is it a (good) idea to do rewrites straight into the DB2 database and do I have to reindex etc or are there better options. Is it an idea to use the external parser instead of the external converter and write my own htdig records ?
I would see making changes to the databases after the fact as an absolute
last recourse.  I'm a big believer in external converters, rather than
external parsers, because they still give you pretty much all the control
you need but without a lot of the added complexities.  External parsers
are very hard to get right, but it's relatively easy to generate HTML
code that will get parsed the way you want.
If you have a technique for getting the titles you want for your PDF files,
then it should be fairly easy to fit this into the pdf2html.pl script.
External converter (or parser) scripts get the full URL to the document
as their third argument, after the temporary file name and the MIME type.


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Updating document titles

Reply via email to