[jira] [Comment Edited] (NIFI-11167) Add Excel Record Reader

Philipp Korniets (Jira) Wed, 29 Nov 2023 09:51:04 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791207#comment-17791207
 ]


Philipp Korniets edited comment on NIFI-11167 at 11/29/23 5:50 PM:
-------------------------------------------------------------------

i might be wrong, but your suggestion has its limitations
i.e. lets assume i have a schema for the above file

{code:java}
{
        "type": "record",
        "name": "test",
        "fields":
        [
    {"name":"Descr","type":["string","null"]},
    {"name":"empty","type":["string","null"]},
    {"name":"Events","type":["string","null"]},
    {"name":"CostPerEvent","type":["string","null"]},
    {"name":"Amount","type":["double","null"]}
]
}
{code}

and my *select statement will be "select sum(Amount) from flowfile *- all works 
well.
next - file provider add column before the Amount - column called *MonthlyCost*.
my schema will still work, but the calculation will be based on a completely 
different column/values. In the scenario where CSVReader has *Use String Fields 
From Header* this will not happen.

We work with a lot of external data providers and this happens a lot - change 
the order of the fields. So to avoid any miscalcs we rely on Field Names as 
they come from the file rather than using schemas.

So if we get back to initial problem. I have this data as an Excel file and 
trying to use new ExcelReader:
- Infer Schema replaces real columns names - dont want that
- using hardcoded schema is a questionable solution - see my point above




was (Author: iiojj2):
i might be wrong, but your suggestion has its limitations
i.e. lets assume i have a schema for the above file

{code:java}
{
        "type": "record",
        "name": "test",
        "fields":
        [
    {"name":"Descr","type":["string","null"]},
    {"name":"empty","type":["string","null"]},
    {"name":"Events","type":["string","null"]},
    {"name":"CostPerEvent","type":["string","null"]},
    {"name":"Amount","type":["double","null"]}
]
}
{code}

and my *select statement will be "select sum(Amount) from flowfile *- all works 
well.
next - file provider add column before the Amount - column called *MonthlyCost*.
my schema will still work, but the calculation will be based on a completely 
different column/values. In the scenario where CSVReader has *Use String Fields 
From Header* this will not happen.

We work with a lot of external data providers and this happens a lot - change 
the order of the fields. So to avoid any miscalcs we rely on Field Names as 
they come from the file rather than using schemas.

> Add Excel Record Reader
> -----------------------
>
>                 Key: NIFI-11167
>                 URL: https://issues.apache.org/jira/browse/NIFI-11167
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: David Handermann
>            Assignee: Daniel Stieglitz
>            Priority: Minor
>             Fix For: 2.0.0-M1, 1.23.0
>
>         Attachments: CSVRecordSetWriter_configuration.png, 
> ExcelReaderConfiguration.png, QueryRecord_configuration.png, Test 
> ExcelReader.xlsx, image-2023-11-28-18-22-07-446.png, 
> image-2023-11-29-15-51-08-386.png, resulting.csv, screenshot-1.png
>
>          Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> A new Excel Record Reader should be implemented to support reading XSLX 
> spreadsheet rows as NiFi Records. This Reader will enable integration with 
> various record-oriented components, obviating the need for the narrowly 
> focused ConvertExcelToCSVProcessor. The initial version of the Excel Reader 
> should not support the legacy binary XLS format.
> The ExcelReader should use a library that supports reading from a stream of 
> rows to avoid consuming large amounts of heap memory during processing.
> The ExcelReader should support configurable properties to read selected 
> sheets. With Excel supporting typed field values, some amount of field type 
> mapping will be required. Additional input filtering properties should not be 
> implemented as existing Processors like QueryRecord support a wide variety of 
> filtering and projection use cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (NIFI-11167) Add Excel Record Reader

Reply via email to