[ 
https://issues.apache.org/jira/browse/FLINK-27078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517898#comment-17517898
 ] 

Martijn Visser edited comment on FLINK-27078 at 4/6/22 7:22 AM:
----------------------------------------------------------------

[~wanglijie95] Most likely yes. But it's good to have a benchmark to actually 
compare data. 


was (Author: martijnvisser):
[~wanglijie95] Most likely yes. 

> There is a performance gap between the new csv source(file system source + 
> CSV format) and legacy CsvTableSource.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27078
>                 URL: https://issues.apache.org/jira/browse/FLINK-27078
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.15.0
>            Reporter: Lijie Wang
>            Priority: Major
>         Attachments: PerformanceTest.java
>
>
> In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . 
> We found that after changing to the new source, TPCDS e2e tests runs slower 
> than before. It only took 20 minutes before, now it takes 30 minutes(See 
> [pr19152|https://github.com/apache/flink/pull/19152] for details).
> We found that mainly because the new csv source is slower than the legacy 
> {{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in 
> [^PerformanceTest.java]  and read a csv file of about 3.8G 
> ({{{}store_sales.dat{}}} of the TPCDS-10G, which can be generated by 
> {{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen]
>  -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running 
> time is very different: On my computer, the job runs for 50s with the new csv 
> source and 20s with the legacy {{{}CsvTableSource{}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to