yihua opened a new pull request, #7404:
URL: https://github.com/apache/hudi/pull/7404

   ### Change Logs
   
   When instantiating the file system view of Hudi, the `HFileBootstrapIndex` 
is also instantiated, which includes two `fs.exists` calls to check if the 
bootstrap index is present.  This can be completely avoided for the file system 
view built for reading the metadata table, as the metadata table never uses a 
bootstrap index.
   
   This PR adds a check on the base path of the table in `HFileBootstrapIndex` 
and avoids the `fs.exists` calls if it is a metadata table.
   
   Below is an example log from Presto showing the FS calls to S3 when 
instantiating `HFileBootstrapIndex`.
   ```
   2022-11-24T22:06:42.979Z     DEBUG   hive-hive-1     com.amazonaws.request   
Sending Request: HEAD https://<redacted>.s3.us-east-2.amazonaws.com 
<redacted>/store_sales/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
 Headers: (amz-sdk-invocation-id: 45caf5e0-6647-d12d-f40b-eabe66add479, 
Content-Type: application/octet-stream, User-Agent: , aws-sdk-java/1.11.697 
Linux/5.4.219-126.411.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/25.342-b07 
java/1.8.0_342 vendor/Oracle_Corporation, presto, ) 
   2022-11-24T22:06:42.989Z     DEBUG   hive-hive-1     com.amazonaws.request   
Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Not 
Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request 
ID: G9DQ3ZB656TBSPXK; S3 Extended Request ID: 
XLcukfeUa9gmmVSEWpk3ciemV5lhiGcf8gxkewhlmJVNV6sZGqAl0Pi7o4H7LTzAFQKZDVVditQ=), 
S3 Extended Request ID: 
XLcukfeUa9gmmVSEWpk3ciemV5lhiGcf8gxkewhlmJVNV6sZGqAl0Pi7o4H7LTzAFQKZDVVditQ=
   2022-11-24T22:06:42.990Z     DEBUG   hive-hive-1     com.amazonaws.request   
Sending Request: HEAD https://<redacted>.s3.us-east-2.amazonaws.com 
<redacted>/store_sales/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/
 Headers: (amz-sdk-invocation-id: 31a4b33c-a381-054d-5323-b41181be1a04, 
Content-Type: application/octet-stream, User-Agent: , aws-sdk-java/1.11.697 
Linux/5.4.219-126.411.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/25.342-b07 
java/1.8.0_342 vendor/Oracle_Corporation, presto, ) 
   2022-11-24T22:06:43.000Z     DEBUG   hive-hive-1     com.amazonaws.request   
Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Not 
Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request 
ID: G9DTM2Z9MYSQBV7G; S3 Extended Request ID: 
m8M6/eGdNShGwOccPoJfMFdgZLtUQ0esU20ZIfszLUSRJsv0NX+dYtcPLBa+4ucNyfHrvf9RL7Y=), 
S3 Extended Request ID: 
m8M6/eGdNShGwOccPoJfMFdgZLtUQ0esU20ZIfszLUSRJsv0NX+dYtcPLBa+4ucNyfHrvf9RL7Y=
   2022-11-24T22:06:43.000Z     DEBUG   hive-hive-1     com.amazonaws.request   
Sending Request: GET https://<redacted>.s3.us-east-2.amazonaws.com / 
Parameters: 
({"prefix":["benchmarks/tpc-ds/hudi/1TB/store_sales/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/"],"delimiter":["/"],"max-keys":["1"],"encoding-type":["url"]}Headers:
 (amz-sdk-invocation-id: 45e2ddc4-aa04-1ec8-9181-e66555efb874, Content-Type: 
application/octet-stream, User-Agent: , aws-sdk-java/1.11.697 
Linux/5.4.219-126.411.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/25.342-b07 
java/1.8.0_342 vendor/Oracle_Corporation, presto, ) 
   2022-11-24T22:06:43.013Z     DEBUG   hive-hive-1     com.amazonaws.request   
Received successful response: 200, AWS Request ID: Y4KXTF61RZAM1D6N
   ```
   
   ### Impact
   
   This PR avoids `fs.exists` calls and reduces latency for instantiating the 
file system view for the metadata table.  For S3 as the storage, 3 requests are 
avoided, as shown above, which saves at least 50ms.
   
   This affects the file listing of partitions based on the metadata table in 
Presto Hive and Hudi connectors.  This performance fix shaves 10+ seconds for 
listing ~1800 partitions in a Presto query with metadata table enabled.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to