[
https://issues.apache.org/jira/browse/NIFI-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943521#comment-17943521
]
Piotr Zalas commented on NIFI-14426:
------------------------------------
I have created a
[commit|https://github.com/apache/nifi/commit/5cae9b89e9413815d49eb25a9b0e0171d3bae5ab]
with implementation. I'm not able to create a PR to NiFi repository, as it
fails with "Pull request creation failed. Validation failed: must be a
collaborator" error.
Implementation of logic for detecting file type is more complicated than I
expected. Unencrypted XLSX files are stored as OOXML. Legacy XLS files are
stored as OLE2 file. The hard part is with encrypted XLSX files, because they
are wrapped in OLE2 file, and inside of this OLE2 there is encrypted OOXML.
Apache POI differentiates these files by presence of specific file root entries
(as visible in WorkbookFactory#create(InputStream, String) implementation). It
seems that to open OLE2 files (encrypted XLSX files and legacy XLS files), the
whole content of file must be loaded to memory by POIFSFileSystem class. The
class is already used by StreamingReader to decrypt XLSX files (see
StreamingWorkbookReader#init(InputStream)).
[~dstiegli1], I haven't tested yet the memory usage of implementation. Can we
come up with some yes-no acceptance criteria? E.g. can we simply say that
reader performance is acceptable when "ExcelReader can load files having 20 MB
of size", or some additional conditions are needed (e.g. measured memory usage
- if yes how to measure it, or testing with NiFi instance having some specific
settings, configuration, etc.)?
> Add support for HSSF format in ExcelReader processor
> ----------------------------------------------------
>
> Key: NIFI-14426
> URL: https://issues.apache.org/jira/browse/NIFI-14426
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Piotr Zalas
> Assignee: Piotr Zalas
> Priority: Major
>
> Currently ExcelReader processor supports only files in new XSSF (.xlsx)
> format. Add support for legacy HSSF (.xls) format.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)