[PR] HADOOP-14837 : Support Read Restored Glacier Objects [hadoop]

via GitHub Thu, 04 Jan 2024 02:03:17 -0800


bpahuja opened a new pull request, #6407:
URL: https://github.com/apache/hadoop/pull/6407


   ### Description of PR
   
   Currently S3A does not distinguish Glacier and Glacier Deep Archive files, 
as it doesn't examine the storage class or verify if an object is in the 
process of restoration from Glacier. Attempting to access an in-progress 
Glacier file via S3A results in an AmazonS3Exception, indicating the 
operation's invalidity for the object's storage class. 
   
   As part of this change, Users will be able to successfully read restored 
glacier objects from the s3 location of the table using S3A. It will ignore any 
Glacier archived files if they are in process of being restored asynchronously. 
There will be no change to the existing behavior and additional configuration 
will be needed to enable the above mentioned flow.
   
   The config which would control the behavior of the S3AFileSystem with 
respect to glacier storage classes will be 
`fs.s3a.glacier.read-restored-objects`
   
   The config can have 3 values:
   
   - `READ_ALL` : This would conform to the current default behavior of not 
taking into account the storage classes retrieved from S3. This will be done to 
keep the current behavior for the users and not changing the experience for 
them.
   
   - `SKIP_ALL_GLACIER`: If this value is set then this will ignore any S3 
Objects which are tagged with Glacier storage classes and retrieve the others.
   
   - `READ_RESTORED_GLACIER_OBJECTS`: If this value is set then restored status 
of the Glacier object will be checked, if restored the objects would be read 
like normal S3 objects else they will be ignored as the objects would not have 
been retrieved from the S3 Glacier. ( To check the restored status, newly 
introduced RestoredStatus will be used which is present in the S3 Object). This 
wasn't previously possible as ListObjects did not return any information about 
the restore status of an object, only it's storage class.
   
   
   A new `FileStatusAcceptor` is created which will use the `RestoreStatus` 
attribute from the `S3Object` and will filter out or include the glacier 
objects from the list as defined by the config. `FileStatusAcceptor` is an 
interface with 3 overloaded predicates, which filter the files based on the 
conditions defined in the said predicates. A new attribute `RestoreStatus` of 
will be used from the response of  `ListObjects` . This field will indicate 
whether the object is unrestored, restoring, or restored, and also when the 
expiration of that restore is.
   
   ### How was this patch tested?
   
   #### Integration Tests (hadoop-aws)
   
   All the Integration tests are passing, the tests were run in accordance with 
https://hadoop.apache.org/docs/current2/hadoop-aws/tools/hadoop-aws/testing.html.
 The tests were executed in the region `us-east-1`.
   
   There were 2 failures observed which seem intermittent and unrelated to the 
change introduced in this CR. As the default behavior of S3AFileSystem was not 
changed. 
   
   Failures observed : 
   ```
   ITestS3ACommitterFactory.testEverything
   ITestS3AConfiguration.testRequestTimeout
   ```
   
   #### Manual Testing 
   
   Manual testing of the change was done with Spark v3.5
   
   A Parquet table was created using the following in Spark-SQL. 
   
   ```
   CREATE DATABASE IF NOT EXISTS glacier_test  location 
"s3a://<bucket>/data/glacier_test";
   
   USE glacier_test;
   
   CREATE TABLE IF NOT EXISTS parquet_glacier_test (id int, data string) using 
parquet location "s3a://<bucket>/data/glacier_test/parquet_glacier_test";
   
   INSERT INTO parquet_glacier_test  VALUES (1, 'a'), (2, 'b'), (3, 'c');
   
   INSERT INTO parquet_glacier_test  VALUES (4, 'a'), (5, 'b'), (6, 'c');
   
   INSERT INTO parquet_glacier_test  VALUES (7, 'a'), (8, 'b'), (9, 'c');
   ```
   Was able to successfully retrieve the data using the following. 
   
   ```
   SELECT * FROM parquet_glacier_test;
   
   +---+----+
   | id|data|
   +---+----+
   |  7|   a|
   |  8|   b|
   |  9|   c|
   |  4|   a|
   |  5|   b|
   |  6|   c|
   |  1|   a|
   |  2|   b|
   |  3|   c|
   +---+----+
   
   ```
   
   The storage class of the file 
`s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet`
 was changed to `Glacier Flexible Retrieval (formerly Glacier)`  from 
`Standard`. 
   
   When trying to retrieve the data again form the same table, the following 
exception was observed. 
   
   ```
   software.amazon.awssdk.services.s3.model.InvalidObjectStateException: The 
operation is not valid for the object's storage class (Service: S3, Status 
Code: 403, Request ID: X05JDR633AAK4TBQ, Extended Request ID: 
uOxWdN4giUAuB9a4YWvnyrXPYCi2U35P5BrHhFO3aLSLLe4GtWhXGXCEJ/Ld5EyGr5b6VezTzeI=):InvalidObjectState
        at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:243)
        at 
org.apache.hadoop.fs.s3a.Invoker.onceTrackingDuration(Invoker.java:149)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:278)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:425)
        at 
org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
        at 
org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282)
        at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326)
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:417)
        at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:536)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
   ```
   
   
   I restarted the spark-sql session by setting the following config. 
   
   ```
   spark-sql --conf 
spark.hadoop.fs.s3a.glacier.read-restored-objects=SKIP_ALL_GLACIER
   ```
   
   Trying to access the table now , resulted in the following. 
   
   ```
   SELECT * FROM parquet_glacier_test;
   
   +---+----+
   | id|data|
   +---+----+
   |  7|   a|
   |  8|   b|
   |  9|   c|
   |  4|   a|
   |  5|   b|
   |  6|   c|
   +---+----+
   
   ```
   
   The spark-sql session was restarted with the following config
   
   ```
   spark-sql --conf 
spark.hadoop.fs.s3a.glacier.read-restored-objects=READ_RESTORED_GLACIER_OBJECTS
   ```
   
   Trying to access the table now , resulted in the same result as the previous 
step as unrestored glacier file was ignored when the table was read.
   
   ```
   SELECT * FROM parquet_glacier_test;
   
   +---+----+
   | id|data|
   +---+----+
   |  7|   a|
   |  8|   b|
   |  9|   c|
   |  4|   a|
   |  5|   b|
   |  6|   c|
   +---+----+
   
   ```
   
   The restore for the file 
`s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet`
 was initiated using the S3 Console.  
   
   Trying to access the table now , resulted in the same result as the previous 
step as the glacier file was still being restored and was not available. 
   
   ```
   SELECT * FROM parquet_glacier_test;
   
   +---+----+
   | id|data|
   +---+----+
   |  7|   a|
   |  8|   b|
   |  9|   c|
   |  4|   a|
   |  5|   b|
   |  6|   c|
   +---+----+
   
   ```
   
   On retrying after 5-7 minutes ( as it was expedited retrieval ) the 
following was the result, which is as expected : 
   
   ```
   SELECT * FROM parquet_glacier_test;
   
   +---+----+
   | id|data|
   +---+----+
   |  7|   a|
   |  8|   b|
   |  9|   c|
   |  4|   a|
   |  5|   b|
   |  6|   c|
   |  1|   a|
   |  2|   b|
   |  3|   c|
   +---+----+
   ```
   
   
   
   ### For code changes:
   
   - [Yes] Does the title or this PR starts with the corresponding JIRA issue 
id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [Yes] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [NA] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [NA] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HADOOP-14837 : Support Read Restored Glacier Objects [hadoop]

Reply via email to