[jira] [Commented] (NIFI-7989) Add Hive "data drift" processor

ASF subversion and git services (Jira) Wed, 13 Jan 2021 14:29:06 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264456#comment-17264456
 ]


ASF subversion and git services commented on NIFI-7989:
-------------------------------------------------------

Commit b9076ca26eb444ebda28ef6c5efbb759e4d1af0f in nifi's branch 
refs/heads/main from Matt Burgess
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=b9076ca ]

NIFI-7989: Add Update Field Names and Record Writer to UpdateHiveTable 
processors

NIFI-7989: Only rewrite records if a field name doesn't match a table column 
name exactly
NIFI-7989: Rewrite records for created tables if Update Field Names is set

This closes #4750.

Signed-off-by: Peter Turcsanyi <[email protected]>


> Add Hive "data drift" processor
> -------------------------------
>
>                 Key: NIFI-7989
>                 URL: https://issues.apache.org/jira/browse/NIFI-7989
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Matt Burgess
>            Assignee: Matt Burgess
>            Priority: Major
>             Fix For: 1.13.0
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> It would be nice to have a Hive processor (one for each Hive NAR) that could 
> check an incoming record-based flowfile against a destination table, and 
> either add columns and/or partition values, or even create the table if it 
> does not exist. Such a processor could be used in a flow where the incoming 
> data's schema can change and we want to be able to write it to a Hive table, 
> preferably by using PutHDFS, PutParquet, or PutORC to place it directly where 
> it can be queried.
> Such a processor should be able to use a HiveConnectionPool to execute any 
> DDL (ALTER TABLE ADD COLUMN, e.g.) necessary to make the table match the 
> incoming data. For Partition Values, they could be provided via a property 
> that supports Expression Language. In such a case, an ALTER TABLE would be 
> issued to add the partition directory.
> Whether the table is created or updated, and whether there are partition 
> values to consider, an attribute should be written to the outgoing flowfile 
> corresponding to the location of the table (and any associated partitions). 
> This supports the idea of having a flow that updates a Hive table based on 
> the incoming data, and then allows the user to put the flowfile directly into 
> the destination location (PutHDFS, e.g.) instead of having to load it using 
> HiveQL or being subject to the restrictions of Hive Streaming tables 
> (ORC-backed, transactional, etc.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-7989) Add Hive "data drift" processor

Reply via email to