Firstly, you don't have the latest version of doc2html. However, what you
have should work.
Secondly, I don't believe that
if (($MIME_type =~ m/$set->{'mime'}/i) and
($Magic =~ m/$set->{'magic'}/s)) { # found the method to use
is in error. I am sure that the changes you made are ill advised, and are
very likely a prime cause of your problems.
Thirdly, if you are indexing "Read 8192 from document", which is part of
the diagnostic output from htdig -vvv itself, then you must be doing
something very odd indeed.
Fourthly, try altering the first line of doc2html.pl to
#!/usr/bin/perl
and in your configuration file have:
external_parsers: application/pdf->text/html
/mypath/doc2html/doc2html.pl
--
David Adams
Computing Services
Southampton University
----- Original Message -----
From: "David Oetiker" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, September 10, 2002 9:52 AM
Subject: Re: [htdig] doc2html.pl for 3.1.6 instead 3.1.4?
Quoting David Adams <[EMAIL PROTECTED]>:
> ... I am using it with version 3.1.6, and I have heard of no
version-dependant
problems.
OK, I willl explain my problems
I've got two versions of the htdig binaries. The 3.1.5 I used for older
projects, but never did PDF indexing. Now I thought I will take the 3.1.6
and
experiment with the PDF Indexing feature and that's where Problems started:
In the actual configuration I use the same files for both of the binarie
versions and get the different results while indexing: the 3.1.5 went
through
the PDF files but the 3.1.6 didn't (see configuration and -vvvvv console
output
below).
Afterwards I did also play with some additional debug output and got the
impression that the 3.1.6 version tries to index the "Read 8192 from
document"
output (is this from xpdf package?) instead of the parsed document contents.
well, here the (little huge) datas:
------------------------------- Configuration / Path
Information --------------
in /mypath/htdig.conf I set:
database_dir: /mypath/db
start_url: http://myurl/x.pdf
locale: en_US
limit_urls_to: http://myurl/
exclude_urls: ""
maintainer: [EMAIL PROTECTED]
max_head_length: 10000
max_doc_size: 1000000
search_algorithm: exact:1 substring:1 synonyms:0.5 endings:0.1
template_map: Raw raw /mypath/raw.html
template_name: raw
matches_per_page: 1000
valid_extensions: .html .htm .shtml .pdf .doc .swf
translate_amp: true
external_parsers: application/pdf->text/html "/usr/bin/perl
/mypath/doc2html/doc2html.pl"
in /mypath/doc2html/doc2html.pl I set:
my $PDF2HTML = '/mypath/doc2html/pdf2html.pl';
and on Line 403 and 439 I corrected:
if (($MIME_type =~ m/$set->{'mime'}/i) and
($Magic =~ m/$set->{'magic'}/s)) { # found the method to use
to:
if (($MIME_type =~ m/$set->{'mime_type'}/i) and
($Magic =~ m/$set->{'magic'}/s)) { # found the method to use
I've downloaded XPDF 1.01 so
in /mypath/doc2html/pdf2html.pl I set:
my $PDFTOTEXT = "/mypath/xpdf-1.01-linux/pdftotext";
my $PDFINFO = "/mypath/xpdf-1.01-linux/pdfinfo";
-------------------------- console
outputs ----------------------------------
(i did only try to indes a single PDF file for this test and i deleted the
db
files before every run of htdig)
-->Using /my_3.1.6_binaries/htdig -vvvvv -c /mypath/htdig.conf -s i get:
...
0:0:0:http://myurl/x.pdf: Retrieval command for http://myurl/x.pdf: GET
http://myurl/x.pdf HTTP/1.0
User-Agent: htdig/3.1.6 (...)
Host: myurl
Header line: HTTP/1.1 200 OK
Header line: Date: Tue, 10 Sep 2002 08:31:24 GMT
Header line: Server: Apache/1.3.1 (Unix)
Header line: Content-Disposition: filename=x.pdf; size=82151
Header line: Generator: websh 2.1 build 2 (c) Netcetera AG
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 231 from document
Read a total of 82151 bytes
size = 82151
pick: satdevl, # servers = 1
htdig: Run complete
htdig: 1 server seen:
htdig: satdevl:8224 1 document
-->THEN Using /my_3.1.6_binaries/htmerge -vvvvv -c ./htdig.conf -s I get:
htmerge: Sorting...
DB2 problem...: missing or empty key value specified
htmerge: Total word count: 0
Deleted, no excerpt: 0/http://myurl/x.pdf
htmerge: Total documents: 0
htmerge: Total size of documents (in K): 0
----------------
-->Using /my_3.1.5_binaries/htdig -vvvvv -c /mypath/htdig.conf -s i get:
0:0:0:http://myurl/x.pdf: Retrieval command for http://myurl/x.pdf: GET
/x.pdf
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: satdevl
Header line: HTTP/1.1 200 OK
Header line: Date: Tue, 10 Sep 2002 08:33:49 GMT
Header line: Server: Apache/1.3.1 (Unix)
Header line: Content-Disposition: filename=x.pdf; size=82151
Header line: Generator: websh 2.1 build 2 (c) Netcetera AG
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 231 from document
Read a total of 82151 bytes
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "iso_8859_1",
LANG = "en_US"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
!! perl: warning: Setting locale failed.
!! perl: warning: Please check that your locale settings:
!! LANGUAGE = (unset),
!! LC_ALL = (unset),
!! LC_CTYPE = "iso_8859_1",
!! LANG = "en_US"
!! are supported and installed on your system.
!! perl: warning: Falling back to the standard locale ("C").
Tag: HTML>, matched -1
Tag: HEAD>, matched -1
Tag: TITLE>, matched 0
word: Microsoft@6
word: Word@9
word: doc2pdf-141-tmp-28123.htm@11
....
word: Technologie@979
word: 7.9%@983
Tag: br>, matched -1
word: Telekommunikation@986
word: 8.7%@992
Tag: /BODY>, matched -1
Tag: /HTML>, matched -1
head: ... many words ...
size = 82151
pick: satdevl, # servers = 1
htdig: Run complete
htdig: 1 server seen:
htdig: satdevl:8224 1 document
-->THEN Using /my_3.1.6_binaries/htmerge -vvvvv -c ./htdig.conf -s
htmerge: Sorting...
htmerge: Merging...
htmerge: 100:konsequent
htmerge: 200:�bertragen
htmerge: Total word count: 203
0/http://myurl/x.pdf
htmerge: Total documents: 1
htmerge: Total doc db size (in K): 80
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=urceforge1&refcode1=3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html