Firstly, you don't have the latest version of doc2html.  However, what you
have should work.

Secondly,  I don't believe that

    if (($MIME_type =~ m/$set->{'mime'}/i) and
        ($Magic =~ m/$set->{'magic'}/s))     { # found the method to use

is in error.  I am sure that the changes you made are ill advised, and are
very likely a prime cause of your problems.

Thirdly,  if you are indexing "Read 8192 from document", which is part of
the diagnostic output from htdig -vvv itself, then you must be doing
something very odd indeed.

Fourthly, try altering the first line of doc2html.pl to

#!/usr/bin/perl

and in your configuration file have:

external_parsers:        application/pdf->text/html
/mypath/doc2html/doc2html.pl


--
David Adams
Computing Services
Southampton University


----- Original Message -----
From: "David Oetiker" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, September 10, 2002 9:52 AM
Subject: Re: [htdig] doc2html.pl for 3.1.6 instead 3.1.4?



Quoting David Adams <[EMAIL PROTECTED]>:
> ... I am using it with version 3.1.6, and I have heard of no
version-dependant
problems.

OK, I willl explain my problems

I've got two versions of the htdig binaries. The 3.1.5 I used for older
projects, but never did PDF indexing. Now I thought I will take the 3.1.6
and
experiment with the PDF Indexing feature and that's where Problems started:
In the actual configuration I use the same files for both of the binarie
versions and get the different results while indexing: the 3.1.5 went
through
the PDF files but the 3.1.6 didn't (see configuration and -vvvvv console
output
below).

Afterwards I did also play with some additional debug output and got the
impression that the 3.1.6 version tries to index the "Read 8192 from
document"
output (is this from xpdf package?) instead of the parsed document contents.

well, here the (little huge) datas:

------------------------------- Configuration / Path
Information --------------

in /mypath/htdig.conf I set:
database_dir: /mypath/db
start_url:              http://myurl/x.pdf
locale:                 en_US
limit_urls_to:          http://myurl/
exclude_urls:           ""
maintainer: [EMAIL PROTECTED]
max_head_length: 10000
max_doc_size: 1000000
search_algorithm: exact:1 substring:1 synonyms:0.5 endings:0.1
template_map:           Raw raw /mypath/raw.html
template_name:          raw
matches_per_page:        1000
valid_extensions:        .html .htm .shtml .pdf .doc .swf
translate_amp:           true
external_parsers:        application/pdf->text/html "/usr/bin/perl
/mypath/doc2html/doc2html.pl"



in /mypath/doc2html/doc2html.pl I set:
my $PDF2HTML = '/mypath/doc2html/pdf2html.pl';

and on Line 403 and 439 I corrected:
    if (($MIME_type =~ m/$set->{'mime'}/i) and
        ($Magic =~ m/$set->{'magic'}/s))     { # found the method to use
to:
    if (($MIME_type =~ m/$set->{'mime_type'}/i) and
        ($Magic =~ m/$set->{'magic'}/s))     { # found the method to use

I've downloaded XPDF 1.01 so
in /mypath/doc2html/pdf2html.pl I set:
my $PDFTOTEXT = "/mypath/xpdf-1.01-linux/pdftotext";
my $PDFINFO = "/mypath/xpdf-1.01-linux/pdfinfo";


-------------------------- console
outputs ----------------------------------
(i did only try to indes a single PDF file for this test and i deleted the
db
files before every run of htdig)

-->Using /my_3.1.6_binaries/htdig -vvvvv -c /mypath/htdig.conf -s i get:
...
0:0:0:http://myurl/x.pdf: Retrieval command for http://myurl/x.pdf: GET
http://myurl/x.pdf HTTP/1.0
User-Agent: htdig/3.1.6 (...)
Host: myurl

Header line: HTTP/1.1 200 OK
Header line: Date: Tue, 10 Sep 2002 08:31:24 GMT
Header line: Server: Apache/1.3.1 (Unix)
Header line: Content-Disposition: filename=x.pdf; size=82151
Header line: Generator: websh 2.1 build 2 (c) Netcetera AG
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 231 from document
Read a total of 82151 bytes
 size = 82151
pick: satdevl, # servers = 1
htdig: Run complete
htdig: 1 server seen:
htdig:     satdevl:8224 1 document

-->THEN Using /my_3.1.6_binaries/htmerge -vvvvv -c ./htdig.conf -s I get:
htmerge: Sorting...
DB2 problem...: missing or empty key value specified

htmerge: Total word count: 0
Deleted, no excerpt: 0/http://myurl/x.pdf

htmerge: Total documents: 0
htmerge: Total size of documents (in K): 0

----------------
-->Using /my_3.1.5_binaries/htdig -vvvvv -c /mypath/htdig.conf -s i get:
0:0:0:http://myurl/x.pdf: Retrieval command for http://myurl/x.pdf: GET
/x.pdf
HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: satdevl

Header line: HTTP/1.1 200 OK
Header line: Date: Tue, 10 Sep 2002 08:33:49 GMT
Header line: Server: Apache/1.3.1 (Unix)
Header line: Content-Disposition: filename=x.pdf; size=82151
Header line: Generator: websh 2.1 build 2 (c) Netcetera AG
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 231 from document
Read a total of 82151 bytes
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LC_CTYPE = "iso_8859_1",
        LANG = "en_US"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
!!      perl: warning: Setting locale failed.
!!      perl: warning: Please check that your locale settings:
!!              LANGUAGE = (unset),
!!              LC_ALL = (unset),
!!              LC_CTYPE = "iso_8859_1",
!!              LANG = "en_US"
!!          are supported and installed on your system.
!!      perl: warning: Falling back to the standard locale ("C").
Tag: HTML>, matched -1
Tag: HEAD>, matched -1
Tag: TITLE>, matched 0
word: Microsoft@6
word: Word@9
word: doc2pdf-141-tmp-28123.htm@11
....
word: Technologie@979
word: 7.9%@983
Tag: br>, matched -1
word: Telekommunikation@986
word: 8.7%@992
Tag: /BODY>, matched -1
Tag: /HTML>, matched -1
head:  ... many words ...
 size = 82151
pick: satdevl, # servers = 1
htdig: Run complete
htdig: 1 server seen:
htdig:     satdevl:8224 1 document

-->THEN Using /my_3.1.6_binaries/htmerge -vvvvv -c ./htdig.conf -s
htmerge: Sorting...
htmerge: Merging...
htmerge: 100:konsequent
htmerge: 200:�bertragen

htmerge: Total word count: 203
0/http://myurl/x.pdf

htmerge: Total documents: 1
htmerge: Total doc db size (in K): 80



-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=urceforge1&refcode1=3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html




-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to