Quoting David Adams <[EMAIL PROTECTED]>: > ... I am using it with version 3.1.6, and I have heard of no version-dependant problems.
OK, I willl explain my problems I've got two versions of the htdig binaries. The 3.1.5 I used for older projects, but never did PDF indexing. Now I thought I will take the 3.1.6 and experiment with the PDF Indexing feature and that's where Problems started: In the actual configuration I use the same files for both of the binarie versions and get the different results while indexing: the 3.1.5 went through the PDF files but the 3.1.6 didn't (see configuration and -vvvvv console output below). Afterwards I did also play with some additional debug output and got the impression that the 3.1.6 version tries to index the "Read 8192 from document" output (is this from xpdf package?) instead of the parsed document contents. well, here the (little huge) datas: ------------------------------- Configuration / Path Information -------------- in /mypath/htdig.conf I set: database_dir: /mypath/db start_url: http://myurl/x.pdf locale: en_US limit_urls_to: http://myurl/ exclude_urls: "" maintainer: [EMAIL PROTECTED] max_head_length: 10000 max_doc_size: 1000000 search_algorithm: exact:1 substring:1 synonyms:0.5 endings:0.1 template_map: Raw raw /mypath/raw.html template_name: raw matches_per_page: 1000 valid_extensions: .html .htm .shtml .pdf .doc .swf translate_amp: true external_parsers: application/pdf->text/html "/usr/bin/perl /mypath/doc2html/doc2html.pl" in /mypath/doc2html/doc2html.pl I set: my $PDF2HTML = '/mypath/doc2html/pdf2html.pl'; and on Line 403 and 439 I corrected: if (($MIME_type =~ m/$set->{'mime'}/i) and ($Magic =~ m/$set->{'magic'}/s)) { # found the method to use to: if (($MIME_type =~ m/$set->{'mime_type'}/i) and ($Magic =~ m/$set->{'magic'}/s)) { # found the method to use I've downloaded XPDF 1.01 so in /mypath/doc2html/pdf2html.pl I set: my $PDFTOTEXT = "/mypath/xpdf-1.01-linux/pdftotext"; my $PDFINFO = "/mypath/xpdf-1.01-linux/pdfinfo"; -------------------------- console outputs ---------------------------------- (i did only try to indes a single PDF file for this test and i deleted the db files before every run of htdig) -->Using /my_3.1.6_binaries/htdig -vvvvv -c /mypath/htdig.conf -s i get: ... 0:0:0:http://myurl/x.pdf: Retrieval command for http://myurl/x.pdf: GET http://myurl/x.pdf HTTP/1.0 User-Agent: htdig/3.1.6 (...) Host: myurl Header line: HTTP/1.1 200 OK Header line: Date: Tue, 10 Sep 2002 08:31:24 GMT Header line: Server: Apache/1.3.1 (Unix) Header line: Content-Disposition: filename=x.pdf; size=82151 Header line: Generator: websh 2.1 build 2 (c) Netcetera AG Header line: Connection: close Header line: Content-Type: application/pdf Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 231 from document Read a total of 82151 bytes size = 82151 pick: satdevl, # servers = 1 htdig: Run complete htdig: 1 server seen: htdig: satdevl:8224 1 document -->THEN Using /my_3.1.6_binaries/htmerge -vvvvv -c ./htdig.conf -s I get: htmerge: Sorting... DB2 problem...: missing or empty key value specified htmerge: Total word count: 0 Deleted, no excerpt: 0/http://myurl/x.pdf htmerge: Total documents: 0 htmerge: Total size of documents (in K): 0 ---------------- -->Using /my_3.1.5_binaries/htdig -vvvvv -c /mypath/htdig.conf -s i get: 0:0:0:http://myurl/x.pdf: Retrieval command for http://myurl/x.pdf: GET /x.pdf HTTP/1.0 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED]) Host: satdevl Header line: HTTP/1.1 200 OK Header line: Date: Tue, 10 Sep 2002 08:33:49 GMT Header line: Server: Apache/1.3.1 (Unix) Header line: Content-Disposition: filename=x.pdf; size=82151 Header line: Generator: websh 2.1 build 2 (c) Netcetera AG Header line: Connection: close Header line: Content-Type: application/pdf Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 231 from document Read a total of 82151 bytes perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = "iso_8859_1", LANG = "en_US" are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). !! perl: warning: Setting locale failed. !! perl: warning: Please check that your locale settings: !! LANGUAGE = (unset), !! LC_ALL = (unset), !! LC_CTYPE = "iso_8859_1", !! LANG = "en_US" !! are supported and installed on your system. !! perl: warning: Falling back to the standard locale ("C"). Tag: HTML>, matched -1 Tag: HEAD>, matched -1 Tag: TITLE>, matched 0 word: Microsoft@6 word: Word@9 word: doc2pdf-141-tmp-28123.htm@11 .... word: Technologie@979 word: 7.9%@983 Tag: br>, matched -1 word: Telekommunikation@986 word: 8.7%@992 Tag: /BODY>, matched -1 Tag: /HTML>, matched -1 head: ... many words ... size = 82151 pick: satdevl, # servers = 1 htdig: Run complete htdig: 1 server seen: htdig: satdevl:8224 1 document -->THEN Using /my_3.1.6_binaries/htmerge -vvvvv -c ./htdig.conf -s htmerge: Sorting... htmerge: Merging... htmerge: 100:konsequent htmerge: 200:�bertragen htmerge: Total word count: 203 0/http://myurl/x.pdf htmerge: Total documents: 1 htmerge: Total doc db size (in K): 80 ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

