[
https://issues.apache.org/jira/browse/NIFI-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859726#comment-17859726
]
Brendan Buhr commented on NIFI-12491:
-------------------------------------
[~dstiegli1] I expect that when using the reader like we did pre the
excelReader Controller service anyone who wants to apply a schema would first
use a split (excelToCSV) which is now replaced with he splitExcel Processor
after which it will feed into a record reader with the excelReader defined and
a schema per sheet.
# the header set first row as the header was an option on the CSV Reader and
was the default we are merely looking to get that same functionality on this
reader, as mentioned by [~iiojj2] there was an option on the excelToCSV to skip
columns and rows as well. since the Excel split would just split the file I
think it would be best to bring those 2 features over to the reader and
possibly make them dynamic (Attributes) which can be dynamically set as flow
attributes and then passed to the reader so that a single reader can be used on
multiple sheets to where the sheet and rows/columns to skip are dynamic
attributes and option with the default being to maintain existing result when
not defined
!image-2024-06-24-18-01-49-886.png!
!image-2024-06-24-18-02-36-592.png!
# I have never encountered a case where multiple sheets had the same structure
and the same header gets applied but I can picture a scenario where data is
split into chunks for various reasons and would be treated as one dataset, we
would usually split this and then do a join of some sort before querying it but
I can see how the benefit of the join of the data in the reader helps, I would
make this optional and not the default behavior so that you can get an error
when no sheet name is specified and trigger alerts on that.
on a side note, 1 thing I have experienced with data where there are 2 header
rows and the first one is a merged header across rows, behavior was that the
value would get written to the first cell and the rest were blank. sometimes
it's nice to have that value in all the remerged cells so that you can merge
that value with the second-row headers. (This is nice to have and not relevant
to this ticket)
> ExcelReader - new Schema Access strategy: Use String Fields From Header
> -----------------------------------------------------------------------
>
> Key: NIFI-12491
> URL: https://issues.apache.org/jira/browse/NIFI-12491
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Affects Versions: 1.23.2
> Reporter: Philipp Korniets
> Assignee: Daniel Stieglitz
> Priority: Major
> Attachments: image-2024-06-24-18-01-49-886.png,
> image-2024-06-24-18-02-36-592.png
>
>
> ExcelReader needs an ability similar to CSVReader to "Use String Fields From
> Header" as a Schema Access Strategy.
> Current implementation has:
> 1. Use Schema Name/Schema Text - this option relies on the order of the
> columns. Possible issues - order of the columns change, but types dont. This
> cause further calculations to be erroneous.
> 2. Infer Schema - replaces real column names with column_1,column_2 etc -
> this again loses the "context" of the column and forces us to rely on how
> columns are ordered.
> Any workarounds make workflow more complicated.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)