Lijie Wang created FLINK-27078:
----------------------------------
Summary: There is a performance gap between the new csv
source(file system source + CSV format) and legacy CsvTableSource.
Key: FLINK-27078
URL: https://issues.apache.org/jira/browse/FLINK-27078
Project: Flink
Issue Type: Improvement
Affects Versions: 1.15.0
Reporter: Lijie Wang
In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . We
found that after changing to the new source, TPCDS e2e tests runs slower than
before. It only took 20 minutes before, now it takes 30 minutes.
We found that mainly because the new csv source is slower than the legacy
{{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in
[^PerformanceTest.java] and read a csv file of about 3.8G (store_sales.dat of
the TPCDS-10G, which can be generated by
{{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
-SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running time
is very different: On my computer, the job runs for 50s with the new csv source
and 20s with the legacy {{{}CsvTableSource{}}}.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)