[
https://issues.apache.org/jira/browse/FLINK-27078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lijie Wang updated FLINK-27078:
-------------------------------
Attachment: PerformanceTest.java
> There is a performance gap between the new csv source(file system source +
> CSV format) and legacy CsvTableSource.
> -----------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-27078
> URL: https://issues.apache.org/jira/browse/FLINK-27078
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 1.15.0
> Reporter: Lijie Wang
> Priority: Major
> Attachments: PerformanceTest.java
>
>
> In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source .
> We found that after changing to the new source, TPCDS e2e tests runs slower
> than before. It only took 20 minutes before, now it takes 30 minutes.
> We found that mainly because the new csv source is slower than the legacy
> {{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in
> [^PerformanceTest.java] and read a csv file of about 3.8G (store_sales.dat
> of the TPCDS-10G, which can be generated by
> {{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
> -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running
> time is very different: On my computer, the job runs for 50s with the new csv
> source and 20s with the legacy {{{}CsvTableSource{}}}.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)