[jira] [Updated] (FLINK-27078) There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.

Lijie Wang (Jira) Tue, 05 Apr 2022 23:57:34 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-27078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lijie Wang updated FLINK-27078:
-------------------------------
    Attachment: PerformanceTest.java

> There is a performance gap between the new csv source(file system source + 
> CSV format) and legacy CsvTableSource.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27078
>                 URL: https://issues.apache.org/jira/browse/FLINK-27078
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.15.0
>            Reporter: Lijie Wang
>            Priority: Major
>         Attachments: PerformanceTest.java
>
>
> In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . 
> We found that after changing to the new source, TPCDS e2e tests runs slower 
> than before. It only took 20 minutes before, now it takes 30 minutes.
> We found that mainly because the new csv source is slower than the legacy 
> {{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in 
> [^PerformanceTest.java]  and read a csv file of about 3.8G (store_sales.dat 
> of the TPCDS-10G, which can be generated by 
> {{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
>  -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running 
> time is very different: On my computer, the job runs for 50s with the new csv 
> source and 20s with the legacy {{{}CsvTableSource{}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (FLINK-27078) There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.

Reply via email to