[jira] [Updated] (HIVE-27927) Iceberg: Authorize location of Iceberg data reads to tables

Janos Kovacs (Jira) Thu, 14 Dec 2023 01:26:08 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Janos Kovacs updated HIVE-27927:
--------------------------------
    Description: 
This is the second phase of the solution to prevent data breach via Iceberg 
data file reads in custom locations via authorizing the data locations to 
tables...

 

Before detailing one possible solution let’s list which parts and why not 
enlisted for restrictions:
 * We need to keep allowing users to write Iceberg tables from Spark, aka to 
have direct file-system access. We can’t really restrict this type of access.
 * Users having direct access to their Iceberg table’s file-system location 
will be able to put whatever they want into their Iceberg table’s 
metadata/manifest. We can’t really restrict these types of changes.
 * Users will have access to Iceberg Metadata Tables as part of Iceberg’s API, 
if not there could be other options to get the information or just guess data 
file locations. We can’t really restrict this information.





The proposed solution has two components:
 # Restrict the engines - Hive & Impala - from reading data files from 
locations that are not authorized for the table
 # Restrict the users to extend the authorized data locations for the tables.

h2. 
1. Restricting the data-locations the engines are allowed to read from

Somehow the engine needs to decide if the data file location coming from the 
manifest file is valid and data can be read from it or it’s malicious and 
should be rejected. Remember, that we can’t block users creating malicious 
manifest files, so the engine must be able to allow/reject locations. 

To be able to decide what locations are allowed and what locations are to be 
rejected - aka fail / error-out the query -, there must be some reference 
information.

Right now the Iceberg specification declares these table properties:
 * write.data.path - Base location for data files (e.g. table location + /data)
 * write.metadata.path - Base location for metadata files (table location + 
/metadata)

These are the bare-minimums the engines “could” allow to read data from - but…

…but there are two problems with these properties:
 * First of all these are reflecting only the current state of the table, while 
changing them will be used by new snapshots, older data files can still located 
on previous data locations and those are still referenced by manifest files in 
a valid way
 * Second is that these properties live in the Iceberg table’s metadata which 
can be manipulated by the user having write access to the file-system the same 
way as the manifest file.

 

Based on these problems, the requirements for the reference information the 
engines can compare data locations to are:
 * Allow multiple location definitions
 * Store them in a way where definitions/alterations can be authorized but 
still easily accessible by the engines

 

_Note: Why multiple locations and not a single one? - to remain consistent with 
Iceberg’s API and provide its full feature-set and having one single location 
value or an array of them iterating through them should not be a big difference 
in implementation._

 

The proposed solution for this is to store the list of allowed locations for a 
given table in HMS Catalog, optionally only in HMS (that way this would not 
need Iceberg spec change), as part of the table definition. Anything stored in 
HMS on both CREATE and ALTER paths can be Authorized like how we currently 
authorize the “{*}metadata_location{*}” value.

E.g. There could be a table property - like read.allowed.data.paths - stored in 
HMS and used by the engines to use that value to push down to the execution and 
validate if no other locations are tried to be accessed.

{noformat}CREATE TABLE …
STORED BY ICEBERG TBLPROPERTIES (
  ‘read.allowed.data.paths’=’/some/old/loc/icebergtbl1,/new/loc/icebergtbl1’ 
){noformat}

 

_Note: to make onboarding users to Iceberg easier and faster, the property 
could inherit the default location of the data path on table creation - but 
only if it’s the ‘warehouse’ based default location! - similar to how the 
current_ _metadata_location_ _authorization is exempted when its location is 
based on the warehouse specific default table location.
_
h2. 2. Restricting the data-locations the users are allowed to set

 

With the above part of the proposal we can make sure that engines read the 
allowed list of locations from a property that can be properly authorized - 
from HMS. The second part is to secure what the tables can have as allowed data 
locations.

Currently there is a similar Authorization when the Iceberg table is 
created/altered: authorizing what “{*}metadata_location{*}” path the user is 
allowed to create in / modify to. 
On Hive side this was fixed via HIVE-27322 and enhanced to reduce overhead for 
default locations via HIVE-27714.

As there is already a “location” authorization, restricting users to define 
malicious values for the new allowed list could utilize the same Authorizer by 
adding each of the defined paths into the same Authorization request. 

If a table is to be relocated or would be just created with a custom data 
location, the user needs to have a custom Ranger Policy authorizing such 
configuration for the given table(s). This is the same today when a user would 
want to create/modify the “metadata_location” - it needs a custom policy. As 
the Ranger Policy used to authorize the location - storage-type / RWStorage - 
allows multiple locations within the same policy, there can be a single one 
covering all possible locations for the specific table(s).

  was:
This is the second phase of the solution to prevent data breach via Iceberg 
data file reads in custom locations via authorizing the data locations to 
tables...

 

Before detailing one possible solution let’s list which parts and why not 
enlisted for restrictions:
 * We need to keep allowing users to write Iceberg tables from Spark, aka to 
have direct file-system access. We can’t really restrict this type of access.
 * Users having direct access to their Iceberg table’s file-system location 
will be able to put whatever they want into their Iceberg table’s 
metadata/manifest. We can’t really restrict these types of changes.
 * Users will have access to Iceberg Metadata Tables as part of Iceberg’s API, 
if not there could be other options to get the information or just guess data 
file locations. We can’t really restrict this information.





The proposed solution has two components:
 # Restrict the engines - Hive & Impala - from reading data files from 
locations that are not authorized for the table
 # Restrict the users to extend the authorized data locations for the tables.

h2. 
1. Restricting the data-locations the engines are allowed to read from

Somehow the engine needs to decide if the data file location coming from the 
manifest file is valid and data can be read from it or it’s malicious and 
should be rejected. Remember, that we can’t block users creating malicious 
manifest files, so the engine must be able to allow/reject locations. 

To be able to decide what locations are allowed and what locations are to be 
rejected - aka fail / error-out the query -, there must be some reference 
information.

Right now the Iceberg specification declares these table properties:
 * write.data.path - Base location for data files (e.g. table location + /data)
 * write.metadata.path - Base location for metadata files (table location + 
/metadata)

These are the bare-minimums the engines “could” allow to read data from - but…

…but there are two problems with these properties:
 * First of all these are reflecting only the current state of the table, while 
changing them will be used by new snapshots, older data files can still located 
on previous data locations and those are still referenced by manifest files in 
a valid way
 * Second is that these properties live in the Iceberg table’s metadata which 
can be manipulated by the user having write access to the file-system the same 
way as the manifest file.

 

Based on these problems, the requirements for the reference information the 
engines can compare data locations to are:
 * Allow multiple location definitions
 * Store them in a way where definitions/alterations can be authorized but 
still easily accessible by the engines

 

_Note: Why multiple locations and not a single one? - to remain consistent with 
Iceberg’s API and provide its full feature-set and having one single location 
value or an array of them iterating through them should not be a big difference 
in implementation._

 

The proposed solution for this is to store the list of allowed locations for a 
given table in HMS Catalog, optionally only in HMS (that way this would not 
need Iceberg spec change), as part of the table definition. Anything stored in 
HMS on both CREATE and ALTER paths can be Authorized like how we currently 
authorize the “{*}metadata_location{*}” value.

E.g. There could be a table property - like read.allowed.data.paths - stored in 
HMS and used by the engines to use that value to push down to the execution and 
validate if no other locations are tried to be accessed.

{{{}CREATE TABLE …
STORED BY ICEBERG{}}}{{{}TBLPROPERTIES (
  ‘read.allowed.data.paths’=’/some/old/loc/icebergtbl1,/new/loc/icebergtbl1’ 
){}}}

 

_Note: to make onboarding users to Iceberg easier and faster, the property 
could inherit the default location of the data path on table creation - but 
only if it’s the ‘warehouse’ based default location! - similar to how the 
current_ _metadata_location_ _authorization is exempted when its location is 
based on the warehouse specific default table location.
_
h2. 2. Restricting the data-locations the users are allowed to set

 

With the above part of the proposal we can make sure that engines read the 
allowed list of locations from a property that can be properly authorized - 
from HMS. The second part is to secure what the tables can have as allowed data 
locations.

Currently there is a similar Authorization when the Iceberg table is 
created/altered: authorizing what “{*}metadata_location{*}” path the user is 
allowed to create in / modify to. 
On Hive side this was fixed via HIVE-27322 and enhanced to reduce overhead for 
default locations via HIVE-27714.

As there is already a “location” authorization, restricting users to define 
malicious values for the new allowed list could utilize the same Authorizer by 
adding each of the defined paths into the same Authorization request. 

If a table is to be relocated or would be just created with a custom data 
location, the user needs to have a custom Ranger Policy authorizing such 
configuration for the given table(s). This is the same today when a user would 
want to create/modify the “metadata_location” - it needs a custom policy. As 
the Ranger Policy used to authorize the location - storage-type / RWStorage - 
allows multiple locations within the same policy, there can be a single one 
covering all possible locations for the specific table(s).


> Iceberg: Authorize location of Iceberg data reads to tables 
> ------------------------------------------------------------
>
>                 Key: HIVE-27927
>                 URL: https://issues.apache.org/jira/browse/HIVE-27927
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Iceberg integration
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Janos Kovacs
>            Priority: Major
>
> This is the second phase of the solution to prevent data breach via Iceberg 
> data file reads in custom locations via authorizing the data locations to 
> tables...
>  
> Before detailing one possible solution let’s list which parts and why not 
> enlisted for restrictions:
>  * We need to keep allowing users to write Iceberg tables from Spark, aka to 
> have direct file-system access. We can’t really restrict this type of access.
>  * Users having direct access to their Iceberg table’s file-system location 
> will be able to put whatever they want into their Iceberg table’s 
> metadata/manifest. We can’t really restrict these types of changes.
>  * Users will have access to Iceberg Metadata Tables as part of Iceberg’s 
> API, if not there could be other options to get the information or just guess 
> data file locations. We can’t really restrict this information.
> The proposed solution has two components:
>  # Restrict the engines - Hive & Impala - from reading data files from 
> locations that are not authorized for the table
>  # Restrict the users to extend the authorized data locations for the tables.
> h2. 
> 1. Restricting the data-locations the engines are allowed to read from
> Somehow the engine needs to decide if the data file location coming from the 
> manifest file is valid and data can be read from it or it’s malicious and 
> should be rejected. Remember, that we can’t block users creating malicious 
> manifest files, so the engine must be able to allow/reject locations. 
> To be able to decide what locations are allowed and what locations are to be 
> rejected - aka fail / error-out the query -, there must be some reference 
> information.
> Right now the Iceberg specification declares these table properties:
>  * write.data.path - Base location for data files (e.g. table location + 
> /data)
>  * write.metadata.path - Base location for metadata files (table location + 
> /metadata)
> These are the bare-minimums the engines “could” allow to read data from - but…
> …but there are two problems with these properties:
>  * First of all these are reflecting only the current state of the table, 
> while changing them will be used by new snapshots, older data files can still 
> located on previous data locations and those are still referenced by manifest 
> files in a valid way
>  * Second is that these properties live in the Iceberg table’s metadata which 
> can be manipulated by the user having write access to the file-system the 
> same way as the manifest file.
>  
> Based on these problems, the requirements for the reference information the 
> engines can compare data locations to are:
>  * Allow multiple location definitions
>  * Store them in a way where definitions/alterations can be authorized but 
> still easily accessible by the engines
>  
> _Note: Why multiple locations and not a single one? - to remain consistent 
> with Iceberg’s API and provide its full feature-set and having one single 
> location value or an array of them iterating through them should not be a big 
> difference in implementation._
>  
> The proposed solution for this is to store the list of allowed locations for 
> a given table in HMS Catalog, optionally only in HMS (that way this would not 
> need Iceberg spec change), as part of the table definition. Anything stored 
> in HMS on both CREATE and ALTER paths can be Authorized like how we currently 
> authorize the “{*}metadata_location{*}” value.
> E.g. There could be a table property - like read.allowed.data.paths - stored 
> in HMS and used by the engines to use that value to push down to the 
> execution and validate if no other locations are tried to be accessed.
> {noformat}CREATE TABLE …
> STORED BY ICEBERG TBLPROPERTIES (
>   ‘read.allowed.data.paths’=’/some/old/loc/icebergtbl1,/new/loc/icebergtbl1’ 
> ){noformat}
>  
> _Note: to make onboarding users to Iceberg easier and faster, the property 
> could inherit the default location of the data path on table creation - but 
> only if it’s the ‘warehouse’ based default location! - similar to how the 
> current_ _metadata_location_ _authorization is exempted when its location is 
> based on the warehouse specific default table location.
> _
> h2. 2. Restricting the data-locations the users are allowed to set
>  
> With the above part of the proposal we can make sure that engines read the 
> allowed list of locations from a property that can be properly authorized - 
> from HMS. The second part is to secure what the tables can have as allowed 
> data locations.
> Currently there is a similar Authorization when the Iceberg table is 
> created/altered: authorizing what “{*}metadata_location{*}” path the user is 
> allowed to create in / modify to. 
> On Hive side this was fixed via HIVE-27322 and enhanced to reduce overhead 
> for default locations via HIVE-27714.
> As there is already a “location” authorization, restricting users to define 
> malicious values for the new allowed list could utilize the same Authorizer 
> by adding each of the defined paths into the same Authorization request. 
> If a table is to be relocated or would be just created with a custom data 
> location, the user needs to have a custom Ranger Policy authorizing such 
> configuration for the given table(s). This is the same today when a user 
> would want to create/modify the “metadata_location” - it needs a custom 
> policy. As the Ranger Policy used to authorize the location - storage-type / 
> RWStorage - allows multiple locations within the same policy, there can be a 
> single one covering all possible locations for the specific table(s).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27927) Iceberg: Authorize location of Iceberg data reads to tables

Reply via email to