[ 
https://issues.apache.org/jira/browse/NIFI-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955668#comment-17955668
 ] 

Piotr Zalas commented on NIFI-14579:
------------------------------------

To add 2 cents, I think this is a larger problem with 
RecordReaderFactory#createRecordReader method signature. The method takes input 
stream. It would be better to take Supplier interface that produces stream. 
This way we could read the whole stream twice, once for schema inference and 
second one to actually read the file. But such change would constitute breaking 
change in ABI.

The problem with 1 MB mark on input stream is that for encrypted Excel files 
the whole file must be read to obtain decrypted stream, because it's wrapped in 
OLE 2 compound. Even when we use StreamingReader to minimise memory usage, the 
implementation of this class uses POIFSFileSystem to read whole encrypted 
content of file before creating decrypted stream. With current implementation 
inferring schema from just 10 first rows of encrypted file will cause read of 
the whole stream, which could later cause issues with going back to the 
beginning of input stream and reading of actual records if the file is larger 
than 1 MB. There is no workaround for this issue because OLE 2 isn't streamable.

Some solution could be to load the whole input stream content to memory, 
without limiting buffer to 1 MB, but it obviously causes heavy memory footprint 
in case of larger files.

> Add parameter to configure number of rows used in schema inference with 
> header in ExcelReader service
> -----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14579
>                 URL: https://issues.apache.org/jira/browse/NIFI-14579
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Piotr Zalas
>            Assignee: Daniel Stieglitz
>            Priority: Major
>
> Currently ExcelReader service allows to configure *Schema Access Strategy* 
> parameter as {*}Use Starting Row{*}. With this parameter, only 10 first rows 
> are used to infer schema of columns in the sheet, as opposed to *Infer 
> Schema* strategy.
> My user requests that all rows in the sheet are used to infer schema. 
> Moreover, the service is used in QueryRecord processor to limit number of 
> rows read (as described in NIFI-14427). The user wants to infer schema only 
> from rows they read, not all rows in the sheet.
> It would be great to add parameter to ExcelReader that allows to configure 
> number of rows read during schema inference (i.e. 
> {{ExcelHeaderSchemaStrategy#NUM_ROWS_TO_DETERMINE_TYPES}} variable). The 
> parameter could probably show conditionally based on value of {*}Schema 
> Access Strategy{*}. Value 0 could have special meaning that all rows in the 
> sheet should be read. The default value could be 10 to preserve existing 
> behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to