[htdig] pdf indexing problems

Jon Sorensen Thu, 16 Dec 2004 17:52:39 -0800

I posted a question recently about indexing pdfs with doc2html

but I can't figure out what the problem is. I believe that the conifg is correct

but there may be a problem there. when I dig a number of pdfs the files

are read but the words indexed are not correct:

word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]

Does anyone know what this indicates?

From looking at the message archives it seems that others have had this problem

but there weren't any solutions posted in the messages

my config and output follows. thanks in advance for any help, I appreciate it.

in doc2html.pl:

$ENV{DOC2HTML_LOG} = '/www/htdig/bin/doc2html/DOC2HTML_LOG';

my $PDF2HTML = '/www/htdig/bin/doc2html/pdf2html.pl';

in pdf2html.pl:

my $PDFTOTEXT = "/usr/bin/pdftotext";
my $PDFINFO = "/usr/bin/pdfinfo";

rundig output:

Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 907 from document
Read a total of 361355 bytes
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
size = 361355
pick: www.flexco.com, # servers = 1
80:358:0:http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: Retrieval command for http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: GET /prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf HTTP/1.0
Cookie: authorized=true
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Host: www.flexco.com

config file:

database_dir: /www/htdig/db_flexco_new

start_url: http://www.flexco.com/index.cfm

limit_urls_to: http://www.flexco.com/

exclude_urls: /cgi-bin/ .cgi /prod_info/safety.cfm /landing.cfm

bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css #.pdf

maintainer: [EMAIL PROTECTED]

max_head_length: 10000

max_doc_size: 5000000

no_excerpt_show_top: true

search_algorithm: exact:1 synonyms:0.5 endings:0.1

template_map: Long long ${common_dir}/flexco/long.html \
Short short ${common_dir}/flexco/short.html
template_name: long
search_results_header: ${common_dir}/flexco/header.html
search_results_footer: ${common_dir}/flexco/footer.html
#search_results_wrapper: ${common_dir}/flexco/wrapper.html
nothing_found_file: ${common_dir}/flexco/nomatch.html
syntax_error_file: ${common_dir}/flexco/syntax.html

cookie: authorized=true

maximum_pages: 20

external_parsers: application/pdf->text/html /www/htdig/bin/doc2html/doc2html.pl
wordlist_compress: false
wordlist_compress_zlib: false

minimum_word_length: 2

bad_word_list: ${common_dir}/badwords.txt

[htdig] pdf indexing problems

Reply via email to