I posted a question recently about indexing pdfs
with doc2html
but I can't figure out what the problem is. I
believe that the conifg is correct
but there may be a problem there. when I dig a
number of pdfs the files
are read but the words indexed are not
correct:
Does anyone know what
this indicates?
From looking at the message archives it seems that
others have had this problem
but there weren't any solutions posted in
the messages
my config and output follows. thanks in advance for
any help, I appreciate it.
in doc2html.pl:
$ENV{DOC2HTML_LOG} =
'/www/htdig/bin/doc2html/DOC2HTML_LOG';
my $PDF2HTML =
'/www/htdig/bin/doc2html/pdf2html.pl';
in pdf2html.pl:
my $PDFTOTEXT = "/usr/bin/pdftotext";
my $PDFINFO = "/usr/bin/pdfinfo"; rundig output:
Content-Type: application/pdf
Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 907 from document Read a total of 361355 bytes word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] size = 361355 pick: www.flexco.com, # servers = 1 80:358:0:http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: Retrieval command for http://www.flexco.com/prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf: GET /prod_info/installation_instruct/AR301_Alligator_Rivet_Gauge.pdf HTTP/1.0 Cookie: authorized=true User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) Host: www.flexco.com config file:
database_dir: /www/htdig/db_flexco_new
start_url: http://www.flexco.com/index.cfm
limit_urls_to: http://www.flexco.com/
exclude_urls: /cgi-bin/ .cgi
/prod_info/safety.cfm /landing.cfm
bad_extensions: .wav .gz .z .sit .au
.zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css #.pdf maintainer: [EMAIL PROTECTED]
max_head_length: 10000
max_doc_size: 5000000
no_excerpt_show_top: true
search_algorithm: exact:1 synonyms:0.5
endings:0.1
template_map: Long long
${common_dir}/flexco/long.html \
Short short ${common_dir}/flexco/short.html template_name: long search_results_header: ${common_dir}/flexco/header.html search_results_footer: ${common_dir}/flexco/footer.html #search_results_wrapper: ${common_dir}/flexco/wrapper.html nothing_found_file: ${common_dir}/flexco/nomatch.html syntax_error_file: ${common_dir}/flexco/syntax.html cookie: authorized=true
maximum_pages: 20
external_parsers: application/pdf->text/html
/www/htdig/bin/doc2html/doc2html.pl
wordlist_compress: false wordlist_compress_zlib: false minimum_word_length: 2
bad_word_list:
${common_dir}/badwords.txt
|
- Re: [htdig] pdf indexing problems Jon Sorensen
- Re: [htdig] pdf indexing problems David Adams
- Re: [htdig] pdf indexing problems Jon Sorensen
- Re: [htdig] pdf indexing problems Steve Yeazel
- Re: [htdig] pdf indexing problems David Adams
- Re: [htdig] pdf indexing problems Jon Sorensen
- Re: [htdig] pdf indexing problems David Adams