[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

GitBox Fri, 31 Jul 2020 21:23:43 -0700


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-667466538



   Tested on 100GB MOR table. A few partitions have 100% duplicate upsert log 
file, the other has parquet files only.
   For parquet files only partitions, the `SNAPSHOT` query is as efficient as 
the `READ_OPTIMIZED` query. The file split with log files is expensive but is 
expected.
   For one 50MB parquet file, the log file was ~1GB. Each file split has been 
loaded as one task.
   Count performance for 50MB parquet + 1GB log:
   merge: 40s
   unmerge: 40s
   Show performance. Because data source V1 doesn't support `limit()`, so it 
will just scan the whole file.
   without column pruning: df_mor.show(10) took 40s
   with column pruning: df_mor.select("_hoodie_commit_time").show(10) took 27s
   @vinothchandar @umehrot2 @bvaradar 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

Reply via email to