[jira] [Commented] (PHOENIX-6608) DISCUSS: Rethink MapReduce split generation

Lars Hofhansl (Jira) Wed, 08 Dec 2021 09:15:08 -0800


    [ 
https://issues.apache.org/jira/browse/PHOENIX-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455885#comment-17455885
 ]


Lars Hofhansl commented on PHOENIX-6608:
----------------------------------------

What kind of worker? Is that some custom worker with a JDBC client?

An M/R job or Trino job is only planned once, right? So that should not be a 
problem there...?

Hopefully the worker do not need to re-load the stats. That would be another 
bug.

 

> DISCUSS: Rethink MapReduce split generation
> -------------------------------------------
>
>                 Key: PHOENIX-6608
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6608
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Lars Hofhansl
>            Priority: Major
>
> I just ran into an issue with Trino, which uses Phoenix' M/R integration to 
> generate splits for its worker nodes.
> See: [https://github.com/trinodb/trino/issues/10143]
> And a fix: [https://github.com/trinodb/trino/pull/10153]
> In short the issue is that with large data size and guideposts enabled 
> (default) Phoenix' RoundRobinResultIterator starts scanning when tasks are 
> submitted to the queue. For large datasets (per client) this fills the heap 
> with pre-fetches HBase result objects.
> MapReduce (and Spark) integrations have presumably the same issue.
> My proposed solution is instead of allowing Phoenix to do intra-split 
> parallelism we create more splits (the fix above groups 20 scans into a split 
> - 20 turned out to be a good number).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PHOENIX-6608) DISCUSS: Rethink MapReduce split generation

Reply via email to