On Sun, 13 Dec 2009, Phil Varner wrote:
1) ExtractorFactory uses the ExcelExtractor rather than the EventBasedExcelExtractor, which causes it to OOM for very large workbooks. I was wondering why this was and if it would be reasonable to change it.
The default is to use the UserModel based ones, as they tend to be more accurate and more configurable. However, I don't see why we couldn't add a "boolean preferEventBased" flag to toggle this.
That said, iirc we only have an event based extractor for .xls, so it might not make all that much difference given that all other files you throw at it will take loads of memory again :/
2) Without an event-based extractor for OOXML workbooks, you can never extract text from very large workbooks. I implemented a hacky workaround to read only the shared strings xml doc, but I was wondering if there was a better way to do this or if there was any interest in polishing this into something that could be part of POI.
You could probably base something on XLSX2CSV which is largely event based
3) QuickButCruddyTextExtractor doesn't extend POIOLE2TextExtractor, and I was wondering if there was a reason why.
It predates the extractor interface by quite a bit, so I'm guessing it was forgotten :/
If you do fancy knocking up some patches for any of this, that's be very much appreciated :)
Nick --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
