[
https://issues.apache.org/jira/browse/NIFI-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954145#comment-17954145
]
David Handermann commented on NIFI-14579:
-----------------------------------------
Reviewing the {{Infer Schema}} and {{Use Starting Row}} strategies, it seems
like the first step would be to align the two approaches. Infer Schema limits
the processing to the first 1 MB, so changing the Use Starting Row strategy to
follow a similar approach would at least align the behavior.
With that being said, 1 MB limit is also arbitrary, even though it supports
reading more than the first 10 rows.
It is worth noting that inferring the schema is an expensive operation, as it
requires reading the input once for inference, and again for actual processing.
Supporting the ability to read all rows for schema inference with either
strategy opens up the potential for performance issues, but with small or
reasonably-sized Excel documents, reading all rows should be acceptable.
> Add parameter to configure number of rows used in schema inference with
> header in ExcelReader service
> -----------------------------------------------------------------------------------------------------
>
> Key: NIFI-14579
> URL: https://issues.apache.org/jira/browse/NIFI-14579
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Piotr Zalas
> Priority: Major
>
> Currently ExcelReader service allows to configure *Schema Access Strategy*
> parameter as {*}Use Starting Row{*}. With this parameter, only 10 first rows
> are used to infer schema of columns in the sheet, as opposed to *Infer
> Schema* strategy.
> My user requests that all rows in the sheet are used to infer schema.
> Moreover, the service is used in QueryRecord processor to limit number of
> rows read (as described in NIFI-14427). The user wants to infer schema only
> from rows they read, not all rows in the sheet.
> It would be great to add parameter to ExcelReader that allows to configure
> number of rows read during schema inference (i.e.
> {{ExcelHeaderSchemaStrategy#NUM_ROWS_TO_DETERMINE_TYPES}} variable). The
> parameter could probably show conditionally based on value of {*}Schema
> Access Strategy{*}. Value 0 could have special meaning that all rows in the
> sheet should be read. The default value could be 10 to preserve existing
> behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)