[
https://issues.apache.org/jira/browse/NIFI-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234092#comment-17234092
]
ASF subversion and git services commented on NIFI-7989:
-------------------------------------------------------
Commit edc060bd92b689c4d610f5ac4aef83073167c8a6 in nifi's branch
refs/heads/main from Matt Burgess
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=edc060b ]
NIFI-7989: Add UpdateHiveTable processors for data drift capability
NIFI-7989: Allow for optional blank line after optional column and partition
headers
NIFI-7989: Incorporated review comments
NIFI-7989: Close Statement when finishing processing
NIFI-7989: Remove database name property, update output table attribute
This closes #4653.
Signed-off-by: Peter Turcsanyi <[email protected]>
> Add Hive "data drift" processor
> -------------------------------
>
> Key: NIFI-7989
> URL: https://issues.apache.org/jira/browse/NIFI-7989
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Extensions
> Reporter: Matt Burgess
> Assignee: Matt Burgess
> Priority: Major
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> It would be nice to have a Hive processor (one for each Hive NAR) that could
> check an incoming record-based flowfile against a destination table, and
> either add columns and/or partition values, or even create the table if it
> does not exist. Such a processor could be used in a flow where the incoming
> data's schema can change and we want to be able to write it to a Hive table,
> preferably by using PutHDFS, PutParquet, or PutORC to place it directly where
> it can be queried.
> Such a processor should be able to use a HiveConnectionPool to execute any
> DDL (ALTER TABLE ADD COLUMN, e.g.) necessary to make the table match the
> incoming data. For Partition Values, they could be provided via a property
> that supports Expression Language. In such a case, an ALTER TABLE would be
> issued to add the partition directory.
> Whether the table is created or updated, and whether there are partition
> values to consider, an attribute should be written to the outgoing flowfile
> corresponding to the location of the table (and any associated partitions).
> This supports the idea of having a flow that updates a Hive table based on
> the incoming data, and then allows the user to put the flowfile directly into
> the destination location (PutHDFS, e.g.) instead of having to load it using
> HiveQL or being subject to the restrictions of Hive Streaming tables
> (ORC-backed, transactional, etc.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)