At 16:16 28/11/01 -0600, Gilles Detillieux wrote:
>According to Marcus Valentine:
>> At 12:32 23/11/01 -0600, Gilles Detillieux wrote:
>>>According to Marcus Valentine:
>>>> At 11:04 23/11/01 -0600, Gilles Detillieux wrote:
>>>>>According to Marcus Valentine:
>>>>>> On my intranet, there is a unfortunate xls file. Although the xls
>>>>>> file is only 266 kb big, converting it with xlhtml 0.3
>>>>>> at the command line results in a 37 Mb html file.
>>>>>> (Running with the -a option [aggressive html
>>>>>> optimization] reduces the file size to 23 Mb).
>>>>>> 
>>>>>> Running htdog 3.1.5 with doc2html.pl version 3 calling xlhtml 0.3
>>>>>> results in an htdig core dump when it gets to this document. 
>>>>>> Htdig runs on Linux Redhat 6.2
>...
>> Here's another back trace. This one was generated when htdig encounter a
>> file that began:
>> 
>> -0.348096
>> -0.070797
>> 0.204147
>> 0.393852
>> 0.449417
>> 
>> and then continued in a similar vein for 921595 lines (file size was 8.3M).
>> This time no external convertors were involved. There seems to be a problem
>> when htdig encounters big files. I've got max_doc_size set big (20 000 000)
>> as I've got some sizable pdfs on my system. I will exclude the files just
>> containing numbers (I missed them previously), but I still have the
>> previous problem with xlhtml.
>...
>> #0  0x400a6d21 in __kill () from /lib/libc.so.6
>> (gdb) bt
>> #0  0x400a6d21 in __kill () from /lib/libc.so.6
>> #1  0x400a6996 in raise (sig=6) at ../sysdeps/posix/raise.c:27
>> #2  0x400a80b8 in abort () at ../sysdeps/generic/abort.c:88
>> #3  0x40057e55 in __default_terminate () from
>> /usr/lib/libstdc++-libc6.1-1.so.2
>> #4  0x40058c1a in terminate () from /usr/lib/libstdc++-libc6.1-1.so.2
>> #5  0x40058cf8 in __eh_alloc (size=36) from
/usr/lib/libstdc++-libc6.1-1.so.2
>> #6  0x40058d88 in __cp_push_exception (value=0xc1d9fd0, type=0x4006af84,
>>     cleanup=0x4005b604 <bad_alloc::~bad_alloc(void)>) from
>> /usr/lib/libstdc++-libc6.1-1.so.2
>> #7  0x4005a252 in __builtin_new (sz=40) from
/usr/lib/libstdc++-libc6.1-1.so.2
>> #8  0x805a86b in strcpy () at ../sysdeps/generic/strcpy.c:30
>> #9  0x80521db in strcpy () at ../sysdeps/generic/strcpy.c:30
>> #10 0x804f531 in strcpy () at ../sysdeps/generic/strcpy.c:30
>> #11 0x8050d25 in strcpy () at ../sysdeps/generic/strcpy.c:30
>> #12 0x805099a in strcpy () at ../sysdeps/generic/strcpy.c:30
>> #13 0x805036d in strcpy () at ../sysdeps/generic/strcpy.c:30
>> #14 0x8054b60 in strcpy () at ../sysdeps/generic/strcpy.c:30
>> #15 0x400a09cb in __libc_start_main (main=0x80543f0 <strcpy+40380>, argc=7,
>> argv=0xbffffb64,
>>     init=0x8049da4 <_init>, fini=0x8090eac <_fini>, rtld_fini=0x4000aea0
>> <_dl_fini>, stack_end=0xbffffb5c)
>>     at ../sysdeps/generic/libc-start.c:92
>> (gdb)
>
>Well, both backtraces you sent me don't seem to point to any part of
>the htdig code, so it's pretty hard to make sense of them.  I'd guess
>that the stack is getting messed up somewhere, causing the program to
>run amuck.  So, we don't have anything particularly conclusive yet,
>but it is interesting to know that the problem happens even for files
>that don't use external parsers or converters.
>
>However, I can't reproduce the problem on my Red hat 6.2 system.
>I tried with a 56 MB HTML file, with max_doc_size set to 40000000, and
>it ran fine on this file.  Can you get htdig to crash on just one file,
>or does it only happen after indexing many files?  If it fails solidly
>on just one file, please let me know where I can pick up a copy of it
>(please don't e-mail the file to me!) and I'll see if I can reproduce
>the problem.  If it requires indexing many files, maybe I could try
>indexing your site from my system to see if my htdig crashes too.
>
>Have you ruled out the possibility of a hardware problem on your
>Linux box?  If you have a bad memory chip, it could lead to all sorts
>of wierdness.

Whoops!  Our Linux machine is an elderly Pentium I, with 48 MB RAM, and 72
MB of swap. We added another 128 MB of swap space, and the problem has gone
away.

We've now indexed over 30,000 documents, the majority of them pdfs and
m*crosoft word docs. The databases occupy about 1.2 GB. The initial dig
took about 15 hours.

Thanks for your help in sorting this out.

Marcus Valentine

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to