Hi Gilles,

I use as external converter conv_doc.pl which handles PDF, Word, RTF and Excel and
when testing it ofcourse only passes the temporary file to pdfinfo / pdftotext.


Did some small tests with my own script which still calls pdfinfo / pdftotext to generate
a HTML document but it now can use an Oracle db as fall back for titles and or metainfo.


Thanks !!

Wim

Gilles Detillieux wrote:

According to Wim Kosten:


Using the external converter switch (application/pdf->text/html) I index PDF files which works perfectly
however the PDF files (abt, 8000) all have undescriptive titles like "Word doc 2" instead of "Proposal for the yearly members meeting".


In order to properly show the titles I use a small script which uses the ASCII dump (-t switch), rewrite that with correct titles
and with htload I load it into the DB2 database. After that U run htmerge and htfuzzy and it all seems to work ...


But for a weird reason I can't search for words in those patched titles. If I wanted to search for proposal my example patched
document would not be found.


I started digging deeper and I (just for the testcase) used a small script which returns "nanananana" instead of the PDF title returned by pdfinfo
and guess: the "nananan" will be found. So that seemed to be the sollution, but .... when htdig calls the converter it gives the
tmp (downloaded PDF) name instead of the actual document name. In this way it's pretty hard to set the correct title if I only know the tmp name
and not the PDF file or the location (URL) of the document.


Is it a (good) idea to do rewrites straight into the DB2 database and do I have to reindex etc or are there better options. Is it an idea to use the
external parser instead of the external converter and write my own htdig records ?



I would see making changes to the databases after the fact as an absolute last recourse. I'm a big believer in external converters, rather than external parsers, because they still give you pretty much all the control you need but without a lot of the added complexities. External parsers are very hard to get right, but it's relatively easy to generate HTML code that will get parsed the way you want.

If you have a technique for getting the titles you want for your PDF files,
then it should be fairly easy to fit this into the pdf2html.pl script.
External converter (or parser) scripts get the full URL to the document
as their third argument, after the temporary file name and the MIME type.








------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to