[
https://issues.apache.org/jira/browse/HIVE-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612695#comment-16612695
]
Ashutosh Chauhan commented on HIVE-20536:
-----------------------------------------
adding tableDesc to GenericUDF is not a good idea. Its a public interface and
exposing internal structures there isn't useful. Instead in genFileSinkDesc()
test for surrogateKey udf and if found set writeid directly on that udf.
If qtest is not possible then lets write junit test for udf and mock Context
object if needed.
Also, can you create a RB for this.
> Add Surrogate Keys function to Hive
> -----------------------------------
>
> Key: HIVE-20536
> URL: https://issues.apache.org/jira/browse/HIVE-20536
> Project: Hive
> Issue Type: Task
> Components: Hive
> Reporter: Miklos Gergely
> Assignee: Miklos Gergely
> Priority: Major
> Attachments: HIVE-20536.01.patch
>
>
> Surrogate keys is an ability to generate and use unique integers for each row
> in a table. If we have that ability then in conjunction with default clause
> we can get surrogate keys functionality. Consider following ddl:
> create table t1 (a string, b bigint default unique_long());
> We already have default clause wherein you can specify a function to provide
> values. So, what we need is udf which can generate unique longs for each row
> across queries for a table.
> Idea is to use write_id . This is a column in metastore table TXN_COMPONENTS
> whose value is determined at compile time to be used during query execution.
> Each query execution generates a new write_id. So, we can seed udf with this
> value during compilation.
> Then we statically allocate ranges for each task from which it can draw next
> long. So, lets say 64-bit write_id we divy up such that last 24 bits belong
> to original usage of it that is txns. Next 16 bits are used for task_attempts
> and last 24 bits to generate new long for each row. This implies we can allow
> 17M txns, 65K tasks and 17M rows in a task. If you hit any of those limits we
> can fail the query.
> Implementation wise: serialize write_id in initialize() of udf. Then during
> execute() we find out what task_attempt current task is and use it along with
> write_id() to get starting long and give a new value on each invocation of
> execute().
> Here we are assuming write_id can be determined at compile time, which should
> be the case but we need to figure out how to get handle to it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)