[jira] [Updated] (KUDU-3146) Consistent table metadata for scans

Grant Henke (Jira) Mon, 08 Jun 2020 19:38:59 -0700


     [ 
https://issues.apache.org/jira/browse/KUDU-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Grant Henke updated KUDU-3146:
------------------------------
    Description: 
Currently there is a time between generating/deserializing a scan token and 
opening the scanner that can result in and invalid schema when scanning the 
table. This is especially the case when a column is renamed or dropped and then 
another column with the same name is created. 

This has been somewhat worked around client side by leveraging column ids and 
mapping the scan token projection to the new schema based on the column ids. 
However, this doesn't work when the scan token sends it's own metadata 
(KUDU-1802). 

We should provide a mechanism to allow the schema to be consistent and 
guaranteed to work from the point a scan token is generated to the time it is 
run/completed.

A simple approach might be to allow column ids to be passed on the scan 
request. Instead of handling the mapping client side, this makes the column ids 
more of a server side concern again (which was the original intent). This is 
also how table renames are handled, by passing the table id. Of course this 
wouldn't support destructive changes such as dropping a column, but that would 
require a much larger change to keep the dropped column for a period of time 
and use the snapshot time to scan using the schema at the given snapshot time.

The other open issue with column name changes in todays implementation is that 
predicates are by name only and the client can't map them the same way it does 
for the projection. Using column ids in the predicates would be required as 
well if that approach is taken. 

  was:
Currently there is a time between generating/deserializing a scan token and 
opening the scanner that can result in and invalid schema when scanning the 
table. This is especially the case when a column is renamed or dropped and then 
another column with the same name is created. 

This has been somewhat worked around client side by leveraging column ids and 
mapping the scan token projection to the new schema based on the column ids. 
However, this doesn't work when the scan token sends it's own metadata 
(KUDU-1802). 

We should provide a mechanism to allow the schema to be consistent and 
guaranteed to work from the point a scan token is generated to the time it is 
run/completed.

A simple approach might be to allow column ids to be passed on the scan 
request. Instead of handling the mapping client side, this makes the column ids 
more of a server side concern again (which was the original intent). This is 
also how table renames are handled, by passing the table id. Of course this 
wouldn't support destructive changes such as dropping a column, but that would 
require a much larger change to keep the dropped column for a period of time 
and use the snapshot time to scan using the schema at the given snapshot time.


> Consistent table metadata for scans
> -----------------------------------
>
>                 Key: KUDU-3146
>                 URL: https://issues.apache.org/jira/browse/KUDU-3146
>             Project: Kudu
>          Issue Type: Improvement
>    Affects Versions: 1.12.0
>            Reporter: Grant Henke
>            Priority: Major
>
> Currently there is a time between generating/deserializing a scan token and 
> opening the scanner that can result in and invalid schema when scanning the 
> table. This is especially the case when a column is renamed or dropped and 
> then another column with the same name is created. 
> This has been somewhat worked around client side by leveraging column ids and 
> mapping the scan token projection to the new schema based on the column ids. 
> However, this doesn't work when the scan token sends it's own metadata 
> (KUDU-1802). 
> We should provide a mechanism to allow the schema to be consistent and 
> guaranteed to work from the point a scan token is generated to the time it is 
> run/completed.
> A simple approach might be to allow column ids to be passed on the scan 
> request. Instead of handling the mapping client side, this makes the column 
> ids more of a server side concern again (which was the original intent). This 
> is also how table renames are handled, by passing the table id. Of course 
> this wouldn't support destructive changes such as dropping a column, but that 
> would require a much larger change to keep the dropped column for a period of 
> time and use the snapshot time to scan using the schema at the given snapshot 
> time.
> The other open issue with column name changes in todays implementation is 
> that predicates are by name only and the client can't map them the same way 
> it does for the projection. Using column ids in the predicates would be 
> required as well if that approach is taken. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KUDU-3146) Consistent table metadata for scans

Reply via email to