bpahuja opened a new pull request, #6407: URL: https://github.com/apache/hadoop/pull/6407
### Description of PR Currently S3A does not distinguish Glacier and Glacier Deep Archive files, as it doesn't examine the storage class or verify if an object is in the process of restoration from Glacier. Attempting to access an in-progress Glacier file via S3A results in an AmazonS3Exception, indicating the operation's invalidity for the object's storage class. As part of this change, Users will be able to successfully read restored glacier objects from the s3 location of the table using S3A. It will ignore any Glacier archived files if they are in process of being restored asynchronously. There will be no change to the existing behavior and additional configuration will be needed to enable the above mentioned flow. The config which would control the behavior of the S3AFileSystem with respect to glacier storage classes will be `fs.s3a.glacier.read-restored-objects` The config can have 3 values: - `READ_ALL` : This would conform to the current default behavior of not taking into account the storage classes retrieved from S3. This will be done to keep the current behavior for the users and not changing the experience for them. - `SKIP_ALL_GLACIER`: If this value is set then this will ignore any S3 Objects which are tagged with Glacier storage classes and retrieve the others. - `READ_RESTORED_GLACIER_OBJECTS`: If this value is set then restored status of the Glacier object will be checked, if restored the objects would be read like normal S3 objects else they will be ignored as the objects would not have been retrieved from the S3 Glacier. ( To check the restored status, newly introduced RestoredStatus will be used which is present in the S3 Object). This wasn't previously possible as ListObjects did not return any information about the restore status of an object, only it's storage class. A new `FileStatusAcceptor` is created which will use the `RestoreStatus` attribute from the `S3Object` and will filter out or include the glacier objects from the list as defined by the config. `FileStatusAcceptor` is an interface with 3 overloaded predicates, which filter the files based on the conditions defined in the said predicates. A new attribute `RestoreStatus` of will be used from the response of `ListObjects` . This field will indicate whether the object is unrestored, restoring, or restored, and also when the expiration of that restore is. ### How was this patch tested? #### Integration Tests (hadoop-aws) All the Integration tests are passing, the tests were run in accordance with https://hadoop.apache.org/docs/current2/hadoop-aws/tools/hadoop-aws/testing.html. The tests were executed in the region `us-east-1`. There were 2 failures observed which seem intermittent and unrelated to the change introduced in this CR. As the default behavior of S3AFileSystem was not changed. Failures observed : ``` ITestS3ACommitterFactory.testEverything ITestS3AConfiguration.testRequestTimeout ``` #### Manual Testing Manual testing of the change was done with Spark v3.5 A Parquet table was created using the following in Spark-SQL. ``` CREATE DATABASE IF NOT EXISTS glacier_test location "s3a://<bucket>/data/glacier_test"; USE glacier_test; CREATE TABLE IF NOT EXISTS parquet_glacier_test (id int, data string) using parquet location "s3a://<bucket>/data/glacier_test/parquet_glacier_test"; INSERT INTO parquet_glacier_test VALUES (1, 'a'), (2, 'b'), (3, 'c'); INSERT INTO parquet_glacier_test VALUES (4, 'a'), (5, 'b'), (6, 'c'); INSERT INTO parquet_glacier_test VALUES (7, 'a'), (8, 'b'), (9, 'c'); ``` Was able to successfully retrieve the data using the following. ``` SELECT * FROM parquet_glacier_test; +---+----+ | id|data| +---+----+ | 7| a| | 8| b| | 9| c| | 4| a| | 5| b| | 6| c| | 1| a| | 2| b| | 3| c| +---+----+ ``` The storage class of the file `s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet` was changed to `Glacier Flexible Retrieval (formerly Glacier)` from `Standard`. When trying to retrieve the data again form the same table, the following exception was observed. ``` software.amazon.awssdk.services.s3.model.InvalidObjectStateException: The operation is not valid for the object's storage class (Service: S3, Status Code: 403, Request ID: X05JDR633AAK4TBQ, Extended Request ID: uOxWdN4giUAuB9a4YWvnyrXPYCi2U35P5BrHhFO3aLSLLe4GtWhXGXCEJ/Ld5EyGr5b6VezTzeI=):InvalidObjectState at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:243) at org.apache.hadoop.fs.s3a.Invoker.onceTrackingDuration(Invoker.java:149) at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:278) at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:425) at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122) at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282) at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326) at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:417) at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:536) at java.io.DataInputStream.readFully(DataInputStream.java:195) ``` I restarted the spark-sql session by setting the following config. ``` spark-sql --conf spark.hadoop.fs.s3a.glacier.read-restored-objects=SKIP_ALL_GLACIER ``` Trying to access the table now , resulted in the following. ``` SELECT * FROM parquet_glacier_test; +---+----+ | id|data| +---+----+ | 7| a| | 8| b| | 9| c| | 4| a| | 5| b| | 6| c| +---+----+ ``` The spark-sql session was restarted with the following config ``` spark-sql --conf spark.hadoop.fs.s3a.glacier.read-restored-objects=READ_RESTORED_GLACIER_OBJECTS ``` Trying to access the table now , resulted in the same result as the previous step as unrestored glacier file was ignored when the table was read. ``` SELECT * FROM parquet_glacier_test; +---+----+ | id|data| +---+----+ | 7| a| | 8| b| | 9| c| | 4| a| | 5| b| | 6| c| +---+----+ ``` The restore for the file `s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet` was initiated using the S3 Console. Trying to access the table now , resulted in the same result as the previous step as the glacier file was still being restored and was not available. ``` SELECT * FROM parquet_glacier_test; +---+----+ | id|data| +---+----+ | 7| a| | 8| b| | 9| c| | 4| a| | 5| b| | 6| c| +---+----+ ``` On retrying after 5-7 minutes ( as it was expedited retrieval ) the following was the result, which is as expected : ``` SELECT * FROM parquet_glacier_test; +---+----+ | id|data| +---+----+ | 7| a| | 8| b| | 9| c| | 4| a| | 5| b| | 6| c| | 1| a| | 2| b| | 3| c| +---+----+ ``` ### For code changes: - [Yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [Yes] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [NA] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [NA] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
