[
https://issues.apache.org/jira/browse/HADOOP-14837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802877#comment-17802877
]
ASF GitHub Bot commented on HADOOP-14837:
-----------------------------------------
bpahuja opened a new pull request, #6407:
URL: https://github.com/apache/hadoop/pull/6407
### Description of PR
Currently S3A does not distinguish Glacier and Glacier Deep Archive files,
as it doesn't examine the storage class or verify if an object is in the
process of restoration from Glacier. Attempting to access an in-progress
Glacier file via S3A results in an AmazonS3Exception, indicating the
operation's invalidity for the object's storage class.
As part of this change, Users will be able to successfully read restored
glacier objects from the s3 location of the table using S3A. It will ignore any
Glacier archived files if they are in process of being restored asynchronously.
There will be no change to the existing behavior and additional configuration
will be needed to enable the above mentioned flow.
The config which would control the behavior of the S3AFileSystem with
respect to glacier storage classes will be
`fs.s3a.glacier.read-restored-objects`
The config can have 3 values:
- `READ_ALL` : This would conform to the current default behavior of not
taking into account the storage classes retrieved from S3. This will be done to
keep the current behavior for the users and not changing the experience for
them.
- `SKIP_ALL_GLACIER`: If this value is set then this will ignore any S3
Objects which are tagged with Glacier storage classes and retrieve the others.
- `READ_RESTORED_GLACIER_OBJECTS`: If this value is set then restored status
of the Glacier object will be checked, if restored the objects would be read
like normal S3 objects else they will be ignored as the objects would not have
been retrieved from the S3 Glacier. ( To check the restored status, newly
introduced RestoredStatus will be used which is present in the S3 Object). This
wasn't previously possible as ListObjects did not return any information about
the restore status of an object, only it's storage class.
A new `FileStatusAcceptor` is created which will use the `RestoreStatus`
attribute from the `S3Object` and will filter out or include the glacier
objects from the list as defined by the config. `FileStatusAcceptor` is an
interface with 3 overloaded predicates, which filter the files based on the
conditions defined in the said predicates. A new attribute `RestoreStatus` of
will be used from the response of `ListObjects` . This field will indicate
whether the object is unrestored, restoring, or restored, and also when the
expiration of that restore is.
### How was this patch tested?
#### Integration Tests (hadoop-aws)
All the Integration tests are passing, the tests were run in accordance with
https://hadoop.apache.org/docs/current2/hadoop-aws/tools/hadoop-aws/testing.html.
The tests were executed in the region `us-east-1`.
There were 2 failures observed which seem intermittent and unrelated to the
change introduced in this CR. As the default behavior of S3AFileSystem was not
changed.
Failures observed :
```
ITestS3ACommitterFactory.testEverything
ITestS3AConfiguration.testRequestTimeout
```
#### Manual Testing
Manual testing of the change was done with Spark v3.5
A Parquet table was created using the following in Spark-SQL.
```
CREATE DATABASE IF NOT EXISTS glacier_test location
"s3a://<bucket>/data/glacier_test";
USE glacier_test;
CREATE TABLE IF NOT EXISTS parquet_glacier_test (id int, data string) using
parquet location "s3a://<bucket>/data/glacier_test/parquet_glacier_test";
INSERT INTO parquet_glacier_test VALUES (1, 'a'), (2, 'b'), (3, 'c');
INSERT INTO parquet_glacier_test VALUES (4, 'a'), (5, 'b'), (6, 'c');
INSERT INTO parquet_glacier_test VALUES (7, 'a'), (8, 'b'), (9, 'c');
```
Was able to successfully retrieve the data using the following.
```
SELECT * FROM parquet_glacier_test;
+---+----+
| id|data|
+---+----+
| 7| a|
| 8| b|
| 9| c|
| 4| a|
| 5| b|
| 6| c|
| 1| a|
| 2| b|
| 3| c|
+---+----+
```
The storage class of the file
`s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet`
was changed to `Glacier Flexible Retrieval (formerly Glacier)` from
`Standard`.
When trying to retrieve the data again form the same table, the following
exception was observed.
```
software.amazon.awssdk.services.s3.model.InvalidObjectStateException: The
operation is not valid for the object's storage class (Service: S3, Status
Code: 403, Request ID: X05JDR633AAK4TBQ, Extended Request ID:
uOxWdN4giUAuB9a4YWvnyrXPYCi2U35P5BrHhFO3aLSLLe4GtWhXGXCEJ/Ld5EyGr5b6VezTzeI=):InvalidObjectState
at
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:243)
at
org.apache.hadoop.fs.s3a.Invoker.onceTrackingDuration(Invoker.java:149)
at
org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:278)
at
org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:425)
at
org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
at
org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404)
at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282)
at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326)
at
org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:417)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:536)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
```
I restarted the spark-sql session by setting the following config.
```
spark-sql --conf
spark.hadoop.fs.s3a.glacier.read-restored-objects=SKIP_ALL_GLACIER
```
Trying to access the table now , resulted in the following.
```
SELECT * FROM parquet_glacier_test;
+---+----+
| id|data|
+---+----+
| 7| a|
| 8| b|
| 9| c|
| 4| a|
| 5| b|
| 6| c|
+---+----+
```
The spark-sql session was restarted with the following config
```
spark-sql --conf
spark.hadoop.fs.s3a.glacier.read-restored-objects=READ_RESTORED_GLACIER_OBJECTS
```
Trying to access the table now , resulted in the same result as the previous
step as unrestored glacier file was ignored when the table was read.
```
SELECT * FROM parquet_glacier_test;
+---+----+
| id|data|
+---+----+
| 7| a|
| 8| b|
| 9| c|
| 4| a|
| 5| b|
| 6| c|
+---+----+
```
The restore for the file
`s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet`
was initiated using the S3 Console.
Trying to access the table now , resulted in the same result as the previous
step as the glacier file was still being restored and was not available.
```
SELECT * FROM parquet_glacier_test;
+---+----+
| id|data|
+---+----+
| 7| a|
| 8| b|
| 9| c|
| 4| a|
| 5| b|
| 6| c|
+---+----+
```
On retrying after 5-7 minutes ( as it was expedited retrieval ) the
following was the result, which is as expected :
```
SELECT * FROM parquet_glacier_test;
+---+----+
| id|data|
+---+----+
| 7| a|
| 8| b|
| 9| c|
| 4| a|
| 5| b|
| 6| c|
| 1| a|
| 2| b|
| 3| c|
+---+----+
```
### For code changes:
- [Yes] Does the title or this PR starts with the corresponding JIRA issue
id (e.g. 'HADOOP-17799. Your PR title ...')?
- [Yes] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [NA] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [NA] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> Handle S3A "glacier" data
> -------------------------
>
> Key: HADOOP-14837
> URL: https://issues.apache.org/jira/browse/HADOOP-14837
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.0.0-beta1
> Reporter: Steve Loughran
> Priority: Minor
>
> SPARK-21797 covers how if you have AWS S3 set to copy some files to glacier,
> they appear in the listing but GETs fail, and so does everything else
> We should think about how best to handle this.
> # report better
> # if listings can identify files which are glaciated then maybe we could have
> an option to filter them out
> # test & see what happens
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]