Hi, On Wed, 2006-08-30 at 13:25 +0200, Martin Soto wrote: > The only message that looks suspicious is the one with the "WARN: > DocumentSummaryInformationStream not found".
I've seen this on troublesome documents before as well, mostly in PowerPoint files. After talking to our resident OLE expert, though, my understanding is that not having that stream is allowed, although uncommon. Is there anything odd about that file? Password protected, perhaps? Created by something other than MS Word? > There's also no normal shutdown messages, like in other log files. Yeah, this is the #1 indication that something went wrong. Normally you would see "Exiting" if it shutdown cleanly. There is actually a bunch of debug info spewed out when a Mono app crashes, but for some reason that doesn't seem to get redirected to the log files like other standard output does. > Could it be that the MS Office parsing library crashed with the last > document listed? Yep, this is almost certainly the case. > Is there a way to run the text extracting code on that document alone to > see it if works? Yep, you can confirm this by running the beagle-extract-content program on the file. It should crash in the same manner. > Fair enough. On the other hand, if the problem is that the C libraries > crash while parsing some file, one could think of an approach that > reduces the risk of such an event actually corrupting the index. > Wouldn't it be possible to parse the document first, storing the text > somewhere, and only then open the index and write the text into it? It > would certainly be slower and/or require more memory, but I'd gladly pay > that price if it actually helps robustness. I think it is not that > important if initial indexing takes somewhat longer, as long as you know > you'll system will be reliably indexed in a few days time. Yeah, you are right and it'd be possible to do that, although there is another host of problems associated with doing the text extraction entirely up front and caching it. There's essentially no bounds on memory usage, for example, which a streaming setup like the one we have now avoids. In my opinion, it would be a lot better to report this crash upstream and try to get it fixed in the Word parsing library rather than essentially create an entirely separate indexing codepath. Part of the beauty of open source is that we can get problems like this fixed at the cause (the wv library); I'm not diametrically opposed to adding a workaround, but I'd prefer it be a last resort. We use the wv1 library for MS Word support. The project website is http://wvware.sourceforge.net/ and they use the AbiWord bug tracker: http://bugzilla.abisource.com/. If you could file a bug with that document, that would be great. Joe _______________________________________________ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers