[jira] [Created] (FLINK-27078) There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.

Lijie Wang (Jira) Tue, 05 Apr 2022 23:54:32 -0700

Lijie Wang created FLINK-27078:
----------------------------------

             Summary: There is a performance gap between the new csv 
source(file system source + CSV format) and legacy CsvTableSource.
                 Key: FLINK-27078
                 URL: https://issues.apache.org/jira/browse/FLINK-27078
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 1.15.0
            Reporter: Lijie Wang



In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . We 
found that after changing to the new source, TPCDS e2e tests runs slower than 
before. It only took 20 minutes before, now it takes 30 minutes.

We found that mainly because the new csv source is slower than the legacy 
{{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in 
[^PerformanceTest.java]  and read a csv file of about 3.8G (store_sales.dat of 
the TPCDS-10G, which can be generated by 
{{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
 -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running time 
is very different: On my computer, the job runs for 50s with the new csv source 
and 20s with the legacy {{{}CsvTableSource{}}}.
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (FLINK-27078) There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.

Reply via email to