[jira] [Commented] (FLINK-30556) Improve the logic for enumerating splits for Hive source to avoid potential OOM

luoyuxia (Jira) Wed, 04 Jan 2023 18:13:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654702#comment-17654702
 ]


luoyuxia commented on FLINK-30556:
----------------------------------

[~Wencong Liu] Thanks for invesgation. TBH, that's what my thought. It'll not 
only help fix the OOM of Hive source, but also help the other source that 
depends on FileSystem Source. My only concern is that it'll touch  public api 
which require a FLIP and some complexity. But I think it's the right way, I 
think you can have a quick poc for it. 

BTW, the other idea which may be more simple is we can just follow what we do 
in ContinuousHiveSplitEnumerator which will enumerate  in a increamental way 
for stream mode.  

I'm fine with any of them, but you can compare with these two ways and choose a 
way.

> Improve the logic for enumerating splits for Hive source to avoid potential 
> OOM
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-30556
>                 URL: https://issues.apache.org/jira/browse/FLINK-30556
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Connectors / Hive
>    Affects Versions: 1.16.0
>            Reporter: luoyuxia
>            Priority: Major
>
> Currently, when read hive source in batch mode, it'll first enumerate all 
> split for the hive table. But when the table is large, the split will be too 
> many which may well cause OOM. Some commuity users has also reported this 
> problem. 
> We need to optimize the logic for enumerating splits for hive table source to 
> avoid potential OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30556) Improve the logic for enumerating splits for Hive source to avoid potential OOM

Reply via email to