[jira] [Commented] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

Piotr Zalas (Jira) Thu, 22 May 2025 07:52:04 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953478#comment-17953478
 ]


Piotr Zalas commented on NIFI-14579:
------------------------------------

Thanks for valuable remarks.

[~exceptionfactory], you are correct about data format, the number of Excel row 
from which and to which the data should be read is specified as an attribute on 
Excel file. In the same flow there might be variation which rows need to be 
read, e.g. for one file it might be rows from 10 to 100, and for the next 
flowfile in queue it might be rows from 200 to 2000. I don't have insights on 
what is the content of files passed through the flow, so I can't say if e.g. 
user wants to read first 100 rows having integer type, and skip remaining rows 
in the file that have e.g. string type. When I asked user for clarification 
they say they prefer to have schema inferred only from read rows to avoid 
potential issues with data type.

[~dstiegli1], I was thinking of using Infer Schema, but I have strict 
requirement that first read row contains header with column names. I believe 
reading header row is only supported with Starting Row strategy.

> Add parameter to configure number of rows used in schema inference with 
> header in ExcelReader service
> -----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14579
>                 URL: https://issues.apache.org/jira/browse/NIFI-14579
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Piotr Zalas
>            Priority: Major
>
> Currently ExcelReader service allows to configure *Schema Access Strategy* 
> parameter as {*}Use Starting Row{*}. With this parameter, only 10 first rows 
> are used to infer schema of columns in the sheet, as opposed to *Infer 
> Schema* strategy.
> My user requests that all rows in the sheet are used to infer schema. 
> Moreover, the service is used in QueryRecord processor to limit number of 
> rows read (as described in NIFI-14427). The user wants to infer schema only 
> from rows they read, not all rows in the sheet.
> It would be great to add parameter to ExcelReader that allows to configure 
> number of rows read during schema inference (i.e. 
> {{ExcelHeaderSchemaStrategy#NUM_ROWS_TO_DETERMINE_TYPES}} variable). The 
> parameter could probably show conditionally based on value of {*}Schema 
> Access Strategy{*}. Value 0 could have special meaning that all rows in the 
> sheet should be read. The default value could be 10 to preserve existing 
> behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

Reply via email to