For the PDF error, please make sure that the PDF document in question can have text extracted from it. Download the file and run pdf2html.pl on it.


If that works make sure that max_doc_size is bigger than the PDF document. (it looks like it is).

Not all PDFs have text in them. Some PDFs are images of text. You can read them on the screen, but there are no characters/words in the PDF to extract and index.

As for the second error, try putting these in your htdig.conf file.

 wordlist_compress: false
 wordlist_compress_zlib: false

We have a mystery bug in the BDB code.  Thanks

Thanks

Neal Richter Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485



On Thu, 26 Aug 2004 [EMAIL PROTECTED] wrote:

hi,

On HP-UX D270 and htdig 3.16, i have no problem for indexing our intranet's
site.
Now, i would like to indexing pdf file with xpdf.
I change the parameters in  htdig.conf and in pdf2html.pl.
When i 'm indexing the site (or a short part of the site), i have this message :

DB2 problem ....PANIC : Invalid argument
/opt/www/htdig/bin/rundig[36] : 23928 Memory fault (coredump)

and later some
DB2 problem... : missing or empty key value specified.

When i use -vvv, i see that some pdf file is pushing :

-----
pick: interligne.xxx.fr, # servers = 1
4:4:1:http://interligne.xxx.fr/5/5.1/OrientationsComInterne.pdf: Retrieval
command for http://interligne.xxx.fr/5/5.1/OrientationsComInterne.pdf:
GET /5/5.1/OrientationsComInterne.pdf HTTP/1.0
User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
Referer: http://interligne.xxx.fr/5/5.1/
Host: interligne.xxx.fr

Header line: HTTP/1.1 200 OK
Header line: Date: Thu, 26 Aug 2004 07:51:43 GMT
Header line: Server: Apache/1.3.19 (Unix) PHP/4.0.5
Header line: Last-Modified: Tue, 15 Jun 2004 06:31:01 GMT
Converted Tue, 15 Jun 2004 06:31:01 GMT to Tue, 15 Jun 2004 06:31:01
Header line: ETag: "4741-13bdf-40ce97a5"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 80863
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read 7135 from document
Read a total of 80863 bytes
size = 80863
pick: interligne.xxx.fr, # servers = 1

-----------

At a moment, the text change with this and any document is pushing :
Deleted, no excerpt: 0/http://interligne.xxx.fr/5/5.1
1/http://interligne.xxx.fr/5/5.1/
Deleted, no excerpt: 5/http://interligne.xxx.fr/5/5.1/Colloque.cfm
Deleted, no excerpt: 7/http://interligne.xxx.fr/5/5.1/DG916.PDF
........


Some one have a idea what can be the matter ? Thanks !




------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_idP47&alloc_id808&op,ick _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev




-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to