thanks a lot to all comments. I updated the hadoopoffice library to include also Stax parsing. In a later stage I will also add reading comments.
On Fri, May 18, 2018 at 11:37 PM Jörn Franke <[email protected]> wrote: > Hi, > > I wonder if someone has already used successfully the StAX parser with > .xlsx files (ie instead of an event driven push model a pull model, cf. > https://docs.oracle.com/javase/tutorial/jaxp/stax/why.html). > > Reason that I ask is that on Big Data platforms (for which we implemented > the HadoopOffice library powered by Apache POI, cf. > https://github.com/ZuInnoTe/hadoopoffice) the event driven model causes a > lot of memory overhead, because virtually all Big Data platforms implement > a pull model, which means if I use the push model provided by the event API > then I need to load the full content in memory to make it available as a > pull model, since those platforms are not event driven. > > I found the StaxHelper class, but I have little idea how it can be used > within Apache POI: > https://poi.apache.org/apidocs/org/apache/poi/util/StaxHelper.html > > > The main goal is to have a light weight approach as proposed by the > current POI event push model for reading .xlsx files ( > https://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api) , but in > form of a pull model, e.g. as illustrated in the following pseudo code: > > XMLInputFactory xmlif; > > XSSFReader.SheetIterator iter; > InputStream currentInputStream; > > XMLStreamReader xmlr; > > > /* function called once at the start of processing */ > public void init() { > r = new XSSFReader( pkg ); > ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg); > // inputstream sheet iterator > iter = (XSSFReader.SheetIterator)r.getSheetsData(); > // XML factory to create Stax Parser > > xmlif = XMLInputFactory.newInstance(); > > } > > > /** the following method is called by the Big Data platform (Flink, Spark, > Hadoop, Hive etc.) do get the next row without reading the full file in > memory as in the DOM or SAX (Push) model > returns null if no further record exist, otherwise sets the current cell > content as String > **/ > public String getNextRow() { > // check if we have data to read > if ((xmlr==null) || xmlr.hasNext()==false) { > if (!iter.hasNext()) { > return null; > } else { > // read sheet into stax parser > > xmlr=xmlif.createXMLStreamReader(iter.next()); > > } > // read the data from stax > // > return xmlr.getText(); // just an example, in fact i need to check for a c > tag, check its type, if it is string then check the string table, otherwise > convert the numeric to a data or indeed a number > } > > Any pitfalls with this approach? I assume I need to link the sharedstring > table somehow to the cell tag (I can derive probably from the source code > of the event API how to do this). > Formula evaluation is not of so much importance for this currently (the > HadoopOffice library offers for formula evaluation to load the full file in > memory). > Of course still the full sharedstringtable needs to be loaded in memory, > but I expect for the data to have a very small sharedstring table and/or > use of numerics. > > Has anyone used the StAX API together with .xlsx files with POI? > > > Thank you. > > best regards >
