I see both these messages frequently.  I believe the explanation is that
these are files larger than the limit set in the configuration file by the
max_doc_size attribute.  Try setting this very large, say

max_doc_size:    50000000

and see if the problem goes away.

I find chosing the value of max_doc_size difficult: set it small and too
many .PPT and .PDF files don't get indexed atall, set it large and there are
always some documents even larger which are fetched every run of htdig but
still never indexed.  Set it too large and it provides no protection.

I suggest a modification of the action of htdig as regards max_doc_size.  At
present (3.1.5 and 3.* I think) htdig fetches upto max_doc_size bytes and no
more.  I suggest that it stops fetching as soon as it establishes that the
document is larger than max_doc_size.  As the size is often given in the
HTTP header this could prevent it fetching megabytes of .PPT and .PDF only
for the conversion utilities to fail because they are given incomplete
files.  Does that sound reasonable?

--
David Adams
Computing Services
Southampton University


----- Original Message -----
From: "Hayes, Jason" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, June 13, 2001 2:41 PM
Subject: [htdig] Help with indexing powerpoint and excel files


> I am currently trying to index a large set of powerpoint and some excel
> files.  With the help of ppthtml and xlhtml, I have been able to index
about
> 25% of them.  Yet, I get the following error message on most of the
> documents:
>
> pptHtml: oledecod.c:341: __OLEdecode: Assertion `i == 0' failed.
> CANNOT DO: (application/vnd.ms-powerpoint) is apparently binary  ***CORE
> DUMPED***
>
> or this error message:
>
> pptHtml: Cannot allocate memory
> CANNOT DO: (application/vnd.ms-powerpoint) is apparently binary
>
> I am using htdig version 3.2.0b3 and xlhtml version 0.2.9.7
> I know these are both classified as unsable, but I get the same results
> either way.
> xlhtml gives me the same sort of errors.
> Could this just be a memory problem?
> The files that it does index are sometimes larger than the files that it
> doesnt index.
>
> I would appreciate any help that is available.
>
> Thank you
>
>
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to