Re: excel text extraction

Nick Burch Sun, 03 Jan 2010 12:48:51 -0800

On Sun, 13 Dec 2009, Phil Varner wrote:

1) ExtractorFactory uses the ExcelExtractor rather than theEventBasedExcelExtractor, which causes it to OOM for very largeworkbooks. I was wondering why this was and if it would be reasonableto change it.

The default is to use the UserModel based ones, as they tend to be moreaccurate and more configurable. However, I don't see why we couldn't add a"boolean preferEventBased" flag to toggle this.

That said, iirc we only have an event based extractor for .xls, so itmight not make all that much difference given that all other files youthrow at it will take loads of memory again :/

2) Without an event-based extractor for OOXML workbooks, you can neverextract text from very large workbooks. I implemented a hackyworkaround to read only the shared strings xml doc, but I was wonderingif there was a better way to do this or if there was any interest inpolishing this into something that could be part of POI.


You could probably base something on XLSX2CSV which is largely event based

3) QuickButCruddyTextExtractor doesn't extend POIOLE2TextExtractor,
and I was wondering if there was a reason why.

It predates the extractor interface by quite a bit, so I'm guessing it wasforgotten :/

If you do fancy knocking up some patches for any of this, that's be verymuch appreciated :)


Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: excel text extraction

Reply via email to