[jira] [Comment Edited] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

Daniel Stieglitz (Jira) Mon, 09 Jun 2025 08:53:09 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17957026#comment-17957026
 ]


Daniel Stieglitz edited comment on NIFI-14579 at 6/9/25 3:52 PM:
-----------------------------------------------------------------

[~exceptionfactory] Just as a reminder NIFI-14538 already addresses  updating 
the SplitExcel Processor to handle encrypted inputs. It is waiting on the next 
release of POI which has bug fix needed for SplitExcel to work properly. 

I already have an implementation for your suggestions (back on 5/26/2025) 
except for Standard being 100 rows and the First Half strategy (). It 
incorporates the standard being 10 rows and an all strategy in a class named 
ExcelHeaderSchemaInference which implements the SchemaInferenceEngine<Row> 
class. This class also incorporates all the functionality 
ExcelHeaderSchemaStrategy had including the latest changes made in PR 
[#9975|https://github.com/apache/nifi/pull/9975].  Would you like me to submit 
a PR for what I have already or should I make further changes to have standard 
be 100 rows and add the first half strategy?


was (Author: JIRAUSER294662):
[~exceptionfactory] Just as a reminder NIFI-14538 already addresses  updating 
the SplitExcel Processor to handle encrypted inputs. It is waiting on the next 
release of POI which has bug fix needed for SplitExcel to work properly. 

I already have an implementation for your suggestions except for Standard being 
100 rows and the First Half strategy. It incorporates the standard being 10 
rows and an all strategy in a class named ExcelHeaderSchemaInference which 
implements the SchemaInferenceEngine<Row> class. This class also incorporates 
all the functionality ExcelHeaderSchemaStrategy had including the latest 
changes made in PR [#9975|https://github.com/apache/nifi/pull/9975].  Would you 
like me to submit a PR for what I have already or should I make further changes 
to have standard be 100 rows and add the first half strategy?

> Add parameter to configure number of rows used in schema inference with 
> header in ExcelReader service
> -----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14579
>                 URL: https://issues.apache.org/jira/browse/NIFI-14579
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Piotr Zalas
>            Assignee: Daniel Stieglitz
>            Priority: Major
>
> Currently ExcelReader service allows to configure *Schema Access Strategy* 
> parameter as {*}Use Starting Row{*}. With this parameter, only 10 first rows 
> are used to infer schema of columns in the sheet, as opposed to *Infer 
> Schema* strategy.
> My user requests that all rows in the sheet are used to infer schema. 
> Moreover, the service is used in QueryRecord processor to limit number of 
> rows read (as described in NIFI-14427). The user wants to infer schema only 
> from rows they read, not all rows in the sheet.
> It would be great to add parameter to ExcelReader that allows to configure 
> number of rows read during schema inference (i.e. 
> {{ExcelHeaderSchemaStrategy#NUM_ROWS_TO_DETERMINE_TYPES}} variable). The 
> parameter could probably show conditionally based on value of {*}Schema 
> Access Strategy{*}. Value 0 could have special meaning that all rows in the 
> sheet should be read. The default value could be 10 to preserve existing 
> behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (NIFI-14579) Add parameter to configure number of rows used in schema inference with header in ExcelReader service

Reply via email to