[
https://issues.apache.org/jira/browse/HIVE-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Janos Kovacs updated HIVE-27927:
--------------------------------
Description:
This is the second phase of the solution to prevent data breach via Iceberg
data file reads in custom locations via authorizing the data locations to
tables...
Before detailing one possible solution let’s list which parts and why not
enlisted for restrictions:
* We need to keep allowing users to write Iceberg tables from Spark, aka to
have direct file-system access. We can’t really restrict this type of access.
* Users having direct access to their Iceberg table’s file-system location
will be able to put whatever they want into their Iceberg table’s
metadata/manifest. We can’t really restrict these types of changes.
* Users will have access to Iceberg Metadata Tables as part of Iceberg’s API,
if not there could be other options to get the information or just guess data
file locations. We can’t really restrict this information.
The proposed solution has two components:
# Restrict the engines - Hive & Impala - from reading data files from
locations that are not authorized for the table
# Restrict the users to extend the authorized data locations for the tables.
h2.
1. Restricting the data-locations the engines are allowed to read from
Somehow the engine needs to decide if the data file location coming from the
manifest file is valid and data can be read from it or it’s malicious and
should be rejected. Remember, that we can’t block users creating malicious
manifest files, so the engine must be able to allow/reject locations.
To be able to decide what locations are allowed and what locations are to be
rejected - aka fail / error-out the query -, there must be some reference
information.
Right now the Iceberg specification declares these table properties:
* write.data.path - Base location for data files (e.g. table location + /data)
* write.metadata.path - Base location for metadata files (table location +
/metadata)
These are the bare-minimums the engines “could” allow to read data from - but…
…but there are two problems with these properties:
* First of all these are reflecting only the current state of the table, while
changing them will be used by new snapshots, older data files can still located
on previous data locations and those are still referenced by manifest files in
a valid way
* Second is that these properties live in the Iceberg table’s metadata which
can be manipulated by the user having write access to the file-system the same
way as the manifest file.
Based on these problems, the requirements for the reference information the
engines can compare data locations to are:
* Allow multiple location definitions
* Store them in a way where definitions/alterations can be authorized but
still easily accessible by the engines
_Note: Why multiple locations and not a single one? - to remain consistent with
Iceberg’s API and provide its full feature-set and having one single location
value or an array of them iterating through them should not be a big difference
in implementation._
The proposed solution for this is to store the list of allowed locations for a
given table in HMS Catalog, optionally only in HMS (that way this would not
need Iceberg spec change), as part of the table definition. Anything stored in
HMS on both CREATE and ALTER paths can be Authorized like how we currently
authorize the “{*}metadata_location{*}” value.
E.g. There could be a table property - like read.allowed.data.paths - stored in
HMS and used by the engines to use that value to push down to the execution and
validate if no other locations are tried to be accessed.
{noformat}CREATE TABLE …
STORED BY ICEBERG TBLPROPERTIES (
‘read.allowed.data.paths’=’/some/old/loc/icebergtbl1,/new/loc/icebergtbl1’
){noformat}
_Note: to make onboarding users to Iceberg easier and faster, the property
could inherit the default location of the data path on table creation - but
only if it’s the ‘warehouse’ based default location! - similar to how the
current_ _metadata_location_ _authorization is exempted when its location is
based on the warehouse specific default table location.
_
h2. 2. Restricting the data-locations the users are allowed to set
With the above part of the proposal we can make sure that engines read the
allowed list of locations from a property that can be properly authorized -
from HMS. The second part is to secure what the tables can have as allowed data
locations.
Currently there is a similar Authorization when the Iceberg table is
created/altered: authorizing what “{*}metadata_location{*}” path the user is
allowed to create in / modify to.
On Hive side this was fixed via HIVE-27322 and enhanced to reduce overhead for
default locations via HIVE-27714.
As there is already a “location” authorization, restricting users to define
malicious values for the new allowed list could utilize the same Authorizer by
adding each of the defined paths into the same Authorization request.
If a table is to be relocated or would be just created with a custom data
location, the user needs to have a custom Ranger Policy authorizing such
configuration for the given table(s). This is the same today when a user would
want to create/modify the “metadata_location” - it needs a custom policy. As
the Ranger Policy used to authorize the location - storage-type / RWStorage -
allows multiple locations within the same policy, there can be a single one
covering all possible locations for the specific table(s).
was:
This is the second phase of the solution to prevent data breach via Iceberg
data file reads in custom locations via authorizing the data locations to
tables...
Before detailing one possible solution let’s list which parts and why not
enlisted for restrictions:
* We need to keep allowing users to write Iceberg tables from Spark, aka to
have direct file-system access. We can’t really restrict this type of access.
* Users having direct access to their Iceberg table’s file-system location
will be able to put whatever they want into their Iceberg table’s
metadata/manifest. We can’t really restrict these types of changes.
* Users will have access to Iceberg Metadata Tables as part of Iceberg’s API,
if not there could be other options to get the information or just guess data
file locations. We can’t really restrict this information.
The proposed solution has two components:
# Restrict the engines - Hive & Impala - from reading data files from
locations that are not authorized for the table
# Restrict the users to extend the authorized data locations for the tables.
h2.
1. Restricting the data-locations the engines are allowed to read from
Somehow the engine needs to decide if the data file location coming from the
manifest file is valid and data can be read from it or it’s malicious and
should be rejected. Remember, that we can’t block users creating malicious
manifest files, so the engine must be able to allow/reject locations.
To be able to decide what locations are allowed and what locations are to be
rejected - aka fail / error-out the query -, there must be some reference
information.
Right now the Iceberg specification declares these table properties:
* write.data.path - Base location for data files (e.g. table location + /data)
* write.metadata.path - Base location for metadata files (table location +
/metadata)
These are the bare-minimums the engines “could” allow to read data from - but…
…but there are two problems with these properties:
* First of all these are reflecting only the current state of the table, while
changing them will be used by new snapshots, older data files can still located
on previous data locations and those are still referenced by manifest files in
a valid way
* Second is that these properties live in the Iceberg table’s metadata which
can be manipulated by the user having write access to the file-system the same
way as the manifest file.
Based on these problems, the requirements for the reference information the
engines can compare data locations to are:
* Allow multiple location definitions
* Store them in a way where definitions/alterations can be authorized but
still easily accessible by the engines
_Note: Why multiple locations and not a single one? - to remain consistent with
Iceberg’s API and provide its full feature-set and having one single location
value or an array of them iterating through them should not be a big difference
in implementation._
The proposed solution for this is to store the list of allowed locations for a
given table in HMS Catalog, optionally only in HMS (that way this would not
need Iceberg spec change), as part of the table definition. Anything stored in
HMS on both CREATE and ALTER paths can be Authorized like how we currently
authorize the “{*}metadata_location{*}” value.
E.g. There could be a table property - like read.allowed.data.paths - stored in
HMS and used by the engines to use that value to push down to the execution and
validate if no other locations are tried to be accessed.
{{{}CREATE TABLE …
STORED BY ICEBERG{}}}{{{}TBLPROPERTIES (
‘read.allowed.data.paths’=’/some/old/loc/icebergtbl1,/new/loc/icebergtbl1’
){}}}
_Note: to make onboarding users to Iceberg easier and faster, the property
could inherit the default location of the data path on table creation - but
only if it’s the ‘warehouse’ based default location! - similar to how the
current_ _metadata_location_ _authorization is exempted when its location is
based on the warehouse specific default table location.
_
h2. 2. Restricting the data-locations the users are allowed to set
With the above part of the proposal we can make sure that engines read the
allowed list of locations from a property that can be properly authorized -
from HMS. The second part is to secure what the tables can have as allowed data
locations.
Currently there is a similar Authorization when the Iceberg table is
created/altered: authorizing what “{*}metadata_location{*}” path the user is
allowed to create in / modify to.
On Hive side this was fixed via HIVE-27322 and enhanced to reduce overhead for
default locations via HIVE-27714.
As there is already a “location” authorization, restricting users to define
malicious values for the new allowed list could utilize the same Authorizer by
adding each of the defined paths into the same Authorization request.
If a table is to be relocated or would be just created with a custom data
location, the user needs to have a custom Ranger Policy authorizing such
configuration for the given table(s). This is the same today when a user would
want to create/modify the “metadata_location” - it needs a custom policy. As
the Ranger Policy used to authorize the location - storage-type / RWStorage -
allows multiple locations within the same policy, there can be a single one
covering all possible locations for the specific table(s).
> Iceberg: Authorize location of Iceberg data reads to tables
> ------------------------------------------------------------
>
> Key: HIVE-27927
> URL: https://issues.apache.org/jira/browse/HIVE-27927
> Project: Hive
> Issue Type: Sub-task
> Components: Iceberg integration
> Affects Versions: 4.0.0-alpha-2
> Reporter: Janos Kovacs
> Priority: Major
>
> This is the second phase of the solution to prevent data breach via Iceberg
> data file reads in custom locations via authorizing the data locations to
> tables...
>
> Before detailing one possible solution let’s list which parts and why not
> enlisted for restrictions:
> * We need to keep allowing users to write Iceberg tables from Spark, aka to
> have direct file-system access. We can’t really restrict this type of access.
> * Users having direct access to their Iceberg table’s file-system location
> will be able to put whatever they want into their Iceberg table’s
> metadata/manifest. We can’t really restrict these types of changes.
> * Users will have access to Iceberg Metadata Tables as part of Iceberg’s
> API, if not there could be other options to get the information or just guess
> data file locations. We can’t really restrict this information.
> The proposed solution has two components:
> # Restrict the engines - Hive & Impala - from reading data files from
> locations that are not authorized for the table
> # Restrict the users to extend the authorized data locations for the tables.
> h2.
> 1. Restricting the data-locations the engines are allowed to read from
> Somehow the engine needs to decide if the data file location coming from the
> manifest file is valid and data can be read from it or it’s malicious and
> should be rejected. Remember, that we can’t block users creating malicious
> manifest files, so the engine must be able to allow/reject locations.
> To be able to decide what locations are allowed and what locations are to be
> rejected - aka fail / error-out the query -, there must be some reference
> information.
> Right now the Iceberg specification declares these table properties:
> * write.data.path - Base location for data files (e.g. table location +
> /data)
> * write.metadata.path - Base location for metadata files (table location +
> /metadata)
> These are the bare-minimums the engines “could” allow to read data from - but…
> …but there are two problems with these properties:
> * First of all these are reflecting only the current state of the table,
> while changing them will be used by new snapshots, older data files can still
> located on previous data locations and those are still referenced by manifest
> files in a valid way
> * Second is that these properties live in the Iceberg table’s metadata which
> can be manipulated by the user having write access to the file-system the
> same way as the manifest file.
>
> Based on these problems, the requirements for the reference information the
> engines can compare data locations to are:
> * Allow multiple location definitions
> * Store them in a way where definitions/alterations can be authorized but
> still easily accessible by the engines
>
> _Note: Why multiple locations and not a single one? - to remain consistent
> with Iceberg’s API and provide its full feature-set and having one single
> location value or an array of them iterating through them should not be a big
> difference in implementation._
>
> The proposed solution for this is to store the list of allowed locations for
> a given table in HMS Catalog, optionally only in HMS (that way this would not
> need Iceberg spec change), as part of the table definition. Anything stored
> in HMS on both CREATE and ALTER paths can be Authorized like how we currently
> authorize the “{*}metadata_location{*}” value.
> E.g. There could be a table property - like read.allowed.data.paths - stored
> in HMS and used by the engines to use that value to push down to the
> execution and validate if no other locations are tried to be accessed.
> {noformat}CREATE TABLE …
> STORED BY ICEBERG TBLPROPERTIES (
> ‘read.allowed.data.paths’=’/some/old/loc/icebergtbl1,/new/loc/icebergtbl1’
> ){noformat}
>
> _Note: to make onboarding users to Iceberg easier and faster, the property
> could inherit the default location of the data path on table creation - but
> only if it’s the ‘warehouse’ based default location! - similar to how the
> current_ _metadata_location_ _authorization is exempted when its location is
> based on the warehouse specific default table location.
> _
> h2. 2. Restricting the data-locations the users are allowed to set
>
> With the above part of the proposal we can make sure that engines read the
> allowed list of locations from a property that can be properly authorized -
> from HMS. The second part is to secure what the tables can have as allowed
> data locations.
> Currently there is a similar Authorization when the Iceberg table is
> created/altered: authorizing what “{*}metadata_location{*}” path the user is
> allowed to create in / modify to.
> On Hive side this was fixed via HIVE-27322 and enhanced to reduce overhead
> for default locations via HIVE-27714.
> As there is already a “location” authorization, restricting users to define
> malicious values for the new allowed list could utilize the same Authorizer
> by adding each of the defined paths into the same Authorization request.
> If a table is to be relocated or would be just created with a custom data
> location, the user needs to have a custom Ranger Policy authorizing such
> configuration for the given table(s). This is the same today when a user
> would want to create/modify the “metadata_location” - it needs a custom
> policy. As the Ranger Policy used to authorize the location - storage-type /
> RWStorage - allows multiple locations within the same policy, there can be a
> single one covering all possible locations for the specific table(s).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)