[GitHub] [hudi] jtmzheng opened a new issue, #5514: [SUPPORT] Read optimized query on MOR table lists files without any Spark action

GitBox Thu, 05 May 2022 23:31:35 -0700


jtmzheng opened a new issue, #5514:
URL: https://github.com/apache/hudi/issues/5514


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   I'm seeing some unexpected behavior where a `read_optimized` Spark query on 
a MOR table is taking ~30 minutes without any action (this is on Hudi 0.9.0 
without metadata table enabled) :
   ```
   start_time = datetime.now()
   read_options = {"hoodie.datasource.query.type": "read_optimized"}
   df = (
       spark.read.format("hudi")
       .options(**read_options)
       .load("{table_s3_path}")
   )
   print(f"Elapsed: {datetime.now() - start_time}")
   ```
   
   ```
   Elapsed: 0:34:38.293859
   ```
   
   A snapshot query returns in ~ 5s (as expected) since there is no action like 
count, collect, show, etc. This also doesn't seem to affect COW tables.
   
   Looking at the Spark UI curiously showed jobs being created referencing 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java#L73.
   
   I got help from a user on Hudi Slack: 
https://apache-hudi.slack.com/archives/C4D716NPQ/p1651784954682329 who pointed 
to:
   
   ```
   int parallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
partitionPaths.size());
   
       List<Pair<String, FileStatus[]>> partitionToFiles = 
engineContext.map(partitionPaths, partitionPathStr -> {
         Path partitionPath = new Path(partitionPathStr);
         FileSystem fs = partitionPath.getFileSystem(hadoopConf.get());
         return Pair.of(partitionPathStr, 
FSUtils.getAllDataFilesInPartition(fs, partitionPath));
       }, parallelism);
   ```
   
   being the culprit where the read optimized query was listing the files in 
the table (there are a lot of files so it's not surprising this takes a while 
since it's not doing any partition pruning). Link: 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java#L119
 
   
   Can anyone provide insight on what's going on? What can I do to work around 
this?
   
   
   Steps to reproduce the behavior:
   
   1. Create a MOR table with some test data
   2. Query the table through Spark using a read optimized query **without** 
any action
   3. Verify Spark jobs are created that listed the files through the Spark UI
   
   **Expected behavior**
   
   The read optimized query does not list the files until an action (eg. if you 
query a specific partition it should only list the files in that partition).
   
   **Environment Description**
   
   * Hudi version : 0.9.0 (EMR)
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   N/A
   
   **Stacktrace**
   
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jtmzheng opened a new issue, #5514: [SUPPORT] Read optimized query on MOR table lists files without any Spark action

Reply via email to