[jira] [Commented] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

David Handermann (Jira) Sat, 07 Jun 2025 12:58:05 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956734#comment-17956734
 ]


David Handermann commented on NIFI-14579:
-----------------------------------------

Thanks for the comments [~zhtk].

Changing the RecordReaderFactory interface to take a Supplier doesn't sound 
like a good option. A PushbackInputStream is one potential strategy for 
handling multiple passes.

You raise a valid point about the 1 MB limitation and encrypted Excel files, so 
some other limitation strategy sounds necessary. Taking a step back on the 
encrypted input case, it sounds like updating the SplitExcel Processor to 
handle encrypted inputs could be one path forward. Writing out decrypted Excel 
files would then enable more standardized processing through the Excel Reader 
with an infer schema strategy.

Back to the question of scoping the information read for schema inference, 
perhaps Standard, First Half, and All, could provide a reasonable set of 
options. The All strategy would cover scenarios where the input could vary 
widely from start to finish. The First Half strategy would cover larger 
variations, and the Standard strategy would use a smaller limit, perhaps moving 
from 10 rows to 100.

I'm open to additional variations, but something that avoids having to provide 
a specific number of rows seems the best approach.

> Add parameter to configure number of rows used in schema inference with 
> header in ExcelReader service
> -----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14579
>                 URL: https://issues.apache.org/jira/browse/NIFI-14579
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Piotr Zalas
>            Assignee: Daniel Stieglitz
>            Priority: Major
>
> Currently ExcelReader service allows to configure *Schema Access Strategy* 
> parameter as {*}Use Starting Row{*}. With this parameter, only 10 first rows 
> are used to infer schema of columns in the sheet, as opposed to *Infer 
> Schema* strategy.
> My user requests that all rows in the sheet are used to infer schema. 
> Moreover, the service is used in QueryRecord processor to limit number of 
> rows read (as described in NIFI-14427). The user wants to infer schema only 
> from rows they read, not all rows in the sheet.
> It would be great to add parameter to ExcelReader that allows to configure 
> number of rows read during schema inference (i.e. 
> {{ExcelHeaderSchemaStrategy#NUM_ROWS_TO_DETERMINE_TYPES}} variable). The 
> parameter could probably show conditionally based on value of {*}Schema 
> Access Strategy{*}. Value 0 could have special meaning that all rows in the 
> sheet should be read. The default value could be 10 to preserve existing 
> behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

Reply via email to