yh2388 commented on issue #2609: URL: https://github.com/apache/hudi/issues/2609#issuecomment-1139392503
> @lw309637554 thanks for your response. > > **_1. about first attempt parquet is 23 secs, but hudi is 40 secs. i see metadata init cost some time in the log._** yes, 2 major spending on meta data loading. is that expected or anything optimized ?. > > 20 sec in this section `2021-02-27T04:27:24.714Z INFO hive-hive-18 org.apache.hudi.hadoop.utils.HoodieInputFormatUtils Total paths to process after hoodie filter 691` `2021-02-27T04:27:45.364Z INFO hive-hive-17 org.apache.hudi.hadoop.utils.HoodieInputFormatUtils Reading hoodie metadata from path s3a://my-test-bucket/tmp/ramesh/hudi_0_7_cl2/sample_data'=` > > another 15 sec goes here `2021-02-27T04:27:46.360Z INFO hive-hive-17 org.apache.hudi.hadoop.utils.HoodieInputFormatUtils Total paths to process after hoodie filter 623` `2021-02-27T04:28:02.931Z DEBUG query-execution-16 io.prestosql.execution.StageStateMachine Stage 20210227_042722_00016_9dket.2 is SCHEDULED` > > **2 about second attempt parquet is very fast,maybe presto support the parquet format local cache.** seems like local caching. will look in to that direction how presto local cache works. > > **3.also parquet and hudi table result is not equal?** both are same dataset. sorry, the result's order not maintained. 151 rows less in hudi dataset because duplicate rows eliminated during ingestion. We have the same problem. Has this problem been solved? I hope to get your help -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
