garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-667466538
Tested on 100GB MOR table. A few partitions have 100% duplicate upsert log
file, the other has parquet files only.
For parquet files only partitions, the `SNAPSHOT` query is as efficient as
the `READ_OPTIMIZED` query. The file split with log files is expensive but is
expected.
For one 50MB parquet file, the log file was ~1GB. Each file split has been
loaded as one task.
Count performance for 50MB parquet + 1GB log:
merge: 40s
unmerge: 40s
Show performance. Because data source V1 doesn't support `limit()`, so it
will just scan the whole file.
without column pruning: df_mor.show(10) took 40s
with column pruning: df_mor.select("_hoodie_commit_time").show(10) took 27s
@vinothchandar @umehrot2 @bvaradar
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]