[ 
https://issues.apache.org/jira/browse/FLINK-27078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijie Wang updated FLINK-27078:
-------------------------------
    Description: 
In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . We 
found that after changing to the new source, TPCDS e2e tests runs slower than 
before. It only took 20 minutes before, now it takes 30 minutes.

We found that mainly because the new csv source is slower than the legacy 
{{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in 
[^PerformanceTest.java]  and read a csv file of about 3.8G ({{store_sales.dat}} 
of the TPCDS-10G, which can be generated by 
{{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
 -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running time 
is very different: On my computer, the job runs for 50s with the new csv source 
and 20s with the legacy {{{}CsvTableSource{}}}.
 

  was:
In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . We 
found that after changing to the new source, TPCDS e2e tests runs slower than 
before. It only took 20 minutes before, now it takes 30 minutes.

We found that mainly because the new csv source is slower than the legacy 
{{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in 
[^PerformanceTest.java]  and read a csv file of about 3.8G (store_sales.dat of 
the TPCDS-10G, which can be generated by 
{{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
 -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running time 
is very different: On my computer, the job runs for 50s with the new csv source 
and 20s with the legacy {{{}CsvTableSource{}}}.
 


> There is a performance gap between the new csv source(file system source + 
> CSV format) and legacy CsvTableSource.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27078
>                 URL: https://issues.apache.org/jira/browse/FLINK-27078
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.15.0
>            Reporter: Lijie Wang
>            Priority: Major
>         Attachments: PerformanceTest.java
>
>
> In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . 
> We found that after changing to the new source, TPCDS e2e tests runs slower 
> than before. It only took 20 minutes before, now it takes 30 minutes.
> We found that mainly because the new csv source is slower than the legacy 
> {{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in 
> [^PerformanceTest.java]  and read a csv file of about 3.8G 
> ({{store_sales.dat}} of the TPCDS-10G, which can be generated by 
> {{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
>  -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running 
> time is very different: On my computer, the job runs for 50s with the new csv 
> source and 20s with the legacy {{{}CsvTableSource{}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to