[jira] [Updated] (FLINK-19595) Flink SQL support S3 select

Flink Jira Bot (Jira) Mon, 01 Nov 2021 15:40:11 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-19595:
-----------------------------------
    Labels: auto-deprioritized-major stale-minor  (was: 
auto-deprioritized-major)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I see this issues has been marked as 
Minor but is unassigned and neither itself nor its Sub-Tasks have been updated 
for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is 
still Minor, please either assign yourself or give an update. Afterwards, 
please remove the label or in 7 days the issue will be deprioritized.


> Flink SQL support S3 select
> ---------------------------
>
>                 Key: FLINK-19595
>                 URL: https://issues.apache.org/jira/browse/FLINK-19595
>             Project: Flink
>          Issue Type: Improvement
>          Components: FileSystems, Table SQL / Ecosystem
>            Reporter: liuxiaolong
>            Priority: Minor
>              Labels: auto-deprioritized-major, stale-minor
>         Attachments: image-2020-11-02-18-08-11-461.png, 
> image-2020-11-02-18-18-14-961.png
>
>
> h4. Summarize
> Flink is based on S3AInputStream.java to select datas stored in Tencent COS, 
> it will call the getObject function of AmazonS3Client.java. 
> Now, Tencent COS  have already support to pushdown the CSV and Parquert file 
> format.
> In these cases, using getObject to select datas will wastes a lots of 
> bandwidth.
> So, I think Flink SQL should support S3 Select, to reduce the waste of 
> bandwidth.
>  
> h4. Design
> 1. In HiveMapredSplitReader.java , we used int[] selectedFields to construct 
> S3 SELECT SQL. And we created a new Class named S3SelectCsvReader which used 
> AmazonS3Client.selectObjectContent function to readLine CSV File.
> !image-2020-11-02-18-08-11-461.png|width=535,height=967!
>  
> !image-2020-11-02-18-18-14-961.png|width=629,height=284!
>  
> 2.  Flink Demo Table:
> 1) Table schema
> Flink SQL> desc cos.test_s3a;
>  root
> |– name: STRING （col1）|
> |– age: INT           （col2）|
> |– dt: STRING      （col3，it's a partition column）|
>  
> 2) Conversion relationship (FLINK SQL Convert To S3 SELECT SQL)
> FlinkSQL                                                                      
>                         S3 SELECT SQL
> select name from cos.test_s3a;                                             => 
>       SELECT s._1, null FROM S3Object s
> select age from cos.test_s3a;                                                 
> =>      SELECT null, s._2 FROM S3Object s
> select dt, name, age from cos.test_s3a;                                =>     
>   SELECT s._1, s._2 FROM S3Object s
> select dt from cos.test_s3a;                                                  
>   =>      SELECT null, null FROM S3Object s
> select * from cos.test_s3a;                                                   
>    =>      SELECT s._1, s._2 FROM S3Object s
> select name from cos.test_s3a where dt='2020-07-15';      =>      SELECT 
> s._1, null FROM S3Object s
>  
> 3) Patch Commit
> https://github.com/Coderlxl/flink/commit/b211f4830a7301bf9283a6d37209000b176913ad



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-19595) Flink SQL support S3 select

Reply via email to