On 9/22/06, Trym B. Asserson <[EMAIL PROTECTED]> wrote:
Any other suggestions? Tomi, you said you'd had difficulties too with certain MS documents, did you manage to find a work-around or did you just have to ignore these documents? So far we've only concentrated on using the plugins in Nutch 0.8 as they're provided, so we have no experience with OO/UNO. Given that POI seems to deliver reasonably good parsing features for MS formats, we're a bit reluctant to throw it out just yet.
No, I haven't found a work-around yet: it seemed too much work at the moment. Right now I'm thinking it may not be necessairy to dump POI in favour of UNO (although I believe it would be better in the long term): maybe it would be possible to work arround the exceptions and still get (at least) most of the text content. I'll probably have a look at it one of these days, although I'm a bit sceptical: wouldn't the original plugin authors have already fixed it if they could help it? t.n.a.