[htdig] Pbs seting up PDF support

Alain DESEINE Mon, 27 Jan 2003 08:59:52 -0800

Hi,

I got problems seting up PDF support for htdig.

Here is info about my setup :

linux version : kernel 2.4.10-4GB
htdig version : 3.1.6 (compiled from source)
install dir /opt/www/htdig
install french support from Dider Lebrun

htdig work well with html files.

i've installed xpdf
i've installed pdf2html, work well from linux prompt
i've installed doc2html, worl well from linux prompt
i've modify the htdig.conf file to call the doc2html converter for application/pdf files

When i run rundig the PDF was not inserted in the PDF got this message in the log :

Deleted, no excerpt: 44/http://www.cabinfo.com/documents/pdf/adsl.pdf
Deleted, no excerpt: 45/http://www.cabinfo.com/documents/pdf/gprs.pdf
Deleted, no excerpt: 43/http://www.cabinfo.com/documents/pdf/wap.pdf

i've run rundig with -vvvv flag and got something like this in the log

Header line: HTTP/1.1 200 OK
Header line: Date: Mon, 27 Jan 2003 15:17:11 GMT
Header line: Server: Apache
Header line: Last-Modified: Mon, 20 Jan 2003 17:22:48 GMT
Converted Mon, 20 Jan 2003 17:22:48 GMT to Mon, 20 Jan 2003 17:22:48
Header line: ETag: "13b7ac-50cf6-3e2c3068"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 330998
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
...
...
Read 8192 from document
Read 3318 from document
Read a total of 330998 bytes
PDF::setContents(330998 bytes)
PDF::parse(http://www.cabinfo.com/documents/pdf/wap.pdf)
PDF::parseNonTextLine: title is "��"

title: ��
PDF::parseNonTextLine: total pages is 49
PDF::parseNonTextLine: start page 1
PDF::parseNonTextLine: begin text block
PDF::parseTextLine("70.5 40.5 TD") cmd=TD
PDF::parseTextLine("0 0 0 rg") cmd=rg
PDF::parseTextLine("/N6 9.75 Tf") cmd=Tf
PDF::parseTextLine("0.08999 Tc") cmd=Tc
PDF::parseTextLine("0 Tw") cmd=Tw
PDF::parseTextLine("(\251)Tj ") cmd=
PDF::parseTextLine("7.5 0 TD") cmd=TD
PDF::parseTextLine("/N8 9.75 Tf") cmd=Tf
PDF::parseTextLine("0.11048 Tc") cmd=Tc
PDF::parseTextLine("0.17898 Tw") cmd=Tw
PDF::parseTextLine("( Alain DESEINE, 1999)Tj ") cmd=
PDF::parseTextLine("375.75 693 TD") cmd=TD
PDF::parseTextLine("/N10 14.25 Tf") cmd=Tf
PDF::parseTextLine("-0.33178 Tc") cmd=Tc
...

and so on for the entire content of the pdf ...

These informations tell to me that it's the internal parser that is used to parse the pdf, and not the doc2html.pl script, but i'm not shure. i've browse and search the list archive, but don't find someting like that, so if you can help me ...

here is the htdig.conf file.

database_dir: /home/info/www/htdig/db
start_url: http://www.cxabinfo.com/ \
http://www.cxabinfo.com/index2.html
limit_urls_to: ${start_url}
exclude_urls: /cgi-bin/ .cgi
#bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
# .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css
#maintainer: [EMAIL PROTECTED]
max_head_length: 10000
max_doc_size: 2000000
no_excerpt_show_top: true
search_algorithm: exact:1 synonyms:0.5 endings:0.1
# template_map: cxabinfo cxabinfo /opt/www/htdig/common/htdig_template.html
template_map: cxabinfo cxabinfo ${common_dir}/htdig_template.html
# template_name: cxabinfo
search_results_header: /opt/www/htdig/common/htdig_header.html
search_results_footer:
nothing_found_file: /opt/www/htdig/common/htdig_nomatch.html
syntax_error_file: /opt/www/htdig/common/htdig_syntaxerror.html
next_page_text: <img src="/htdig/buttonr.gif" border="0" align="middle" width="30" height="30" alt="next">
no_next_page_text:
prev_page_text: <img src="/htdig/buttonl.gif" border="0" align="middle" width="30" height="30" alt="prev">
no_prev_page_text:
external_parsers: application/rtf->text/html /opt/www/htdig/bin/doc2html.pl \
text/rtf->text/html /opt/www/htdig/bin/doc2html.pl \
application/pdf->text/html /opt/www/htdig/bin/doc2html.pl \
application/postscript->text/html /opt/www/htdig/bin/doc2html.pl \
application/msword->text/html /opt/www/htdig/bin/doc2html.pl \
application/wordperfect5.1->text/html /opt/www/htdig/bin/doc2html.pl \
application/msexcel->text/html /opt/www/htdig/bin/doc2html.pl \
application/vnd.ms-excel->text/html /opt/www/htdig/bin/doc2html.pl \
application/vnd.ms-powerpoint->text/html /opt/www/htdig/bin/doc2html.pl
application/x-shockwave-flash->text/html /opt/www/htdig/bin/doc2html.pl \
application/x-shockwave-flash2-preview->text/html /opt/www/htdig/bin/doc2html.pl

# local variables:
# ----- debut de francisation -----
locale: fr_FR
valid_punctuation: ._/!#$%^&

# Search options names:
method_names: and 'Tous les mots' or 'Un des mots' boolean Bool�en
sort_names: score Score time Date title Titre revscore 'Score inverse' revtime 'Date inverse' revtitle 'Titre inverse'

# language files:
endings_dictionary: ${common_dir}/francais.0
endings_affix_file: ${common_dir}/francais.aff
bad_word_list: ${common_dir}/bad_words.fr
synonym_dictionary: ${common_dir}/synonyms.fr
# ----- fin de francisation -----

# mode: text
# eval: (if (eq window-system 'x) (progn (setq font-lock-keywords (list '("^#.*" . font-lock-keyword-face) '("^[a-zA-Z][^ :]+" . font-lock-function-name-face) '("[+$]*:" . font-lock-comment-face) )) (font-lock-mode)))
# end:

Many thanks for responses.

Alain DESEINE.

-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

[htdig] Pbs seting up PDF support

Reply via email to