[jira] [Comment Edited] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

Daniel Stieglitz (Jira) Thu, 22 May 2025 09:18:13 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953501#comment-17953501
 ]


Daniel Stieglitz edited comment on NIFI-14579 at 5/22/25 4:17 PM:
------------------------------------------------------------------

[~exceptionfactory] Would it make sense to have an infer schema strategy which 
reads the whole file but like the Starting Row Strategy uses the first read row 
containing the column headers for the record field names? The Starting Row 
Strategy could still be for efficiency but for those users who still want the 
column names as the field names could use this. So in short there would an 
Infer Schema Strategy which reads the whole file and uses the column names as 
field names, Starting Row Strategy which reads the first 10 rows and uses the 
column names as field names and an Infer Schema Strategy which reads the whole 
file but does not use any previous column names for the field names.


was (Author: JIRAUSER294662):
[~exceptionfactory] Would it make sense to have an infer schema strategy which 
reads the whole file but like the Starting Row Strategy uses the first read row 
containing the column headers for the record field names? The Starting Row 
Strategy could still be for efficiency but for those users who still want the 
column names as the field names could use this. So in short there would an 
Infer Schema Strategy which reads the whole file and uses the column names as 
field names, Starting Row Strategy which reads the first 10 rows and uses the 
column names as field names and Infer schema which reads the whole file but 
does not use any previous column names for the field names.

> Add parameter to configure number of rows used in schema inference with 
> header in ExcelReader service
> -----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14579
>                 URL: https://issues.apache.org/jira/browse/NIFI-14579
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Piotr Zalas
>            Priority: Major
>
> Currently ExcelReader service allows to configure *Schema Access Strategy* 
> parameter as {*}Use Starting Row{*}. With this parameter, only 10 first rows 
> are used to infer schema of columns in the sheet, as opposed to *Infer 
> Schema* strategy.
> My user requests that all rows in the sheet are used to infer schema. 
> Moreover, the service is used in QueryRecord processor to limit number of 
> rows read (as described in NIFI-14427). The user wants to infer schema only 
> from rows they read, not all rows in the sheet.
> It would be great to add parameter to ExcelReader that allows to configure 
> number of rows read during schema inference (i.e. 
> {{ExcelHeaderSchemaStrategy#NUM_ROWS_TO_DETERMINE_TYPES}} variable). The 
> parameter could probably show conditionally based on value of {*}Schema 
> Access Strategy{*}. Value 0 could have special meaning that all rows in the 
> sheet should be read. The default value could be 10 to preserve existing 
> behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

Reply via email to