[ https://issues.apache.org/jira/browse/PHOENIX-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455885#comment-17455885 ]
Lars Hofhansl commented on PHOENIX-6608: ---------------------------------------- What kind of worker? Is that some custom worker with a JDBC client? An M/R job or Trino job is only planned once, right? So that should not be a problem there...? Hopefully the worker do not need to re-load the stats. That would be another bug. > DISCUSS: Rethink MapReduce split generation > ------------------------------------------- > > Key: PHOENIX-6608 > URL: https://issues.apache.org/jira/browse/PHOENIX-6608 > Project: Phoenix > Issue Type: Improvement > Reporter: Lars Hofhansl > Priority: Major > > I just ran into an issue with Trino, which uses Phoenix' M/R integration to > generate splits for its worker nodes. > See: [https://github.com/trinodb/trino/issues/10143] > And a fix: [https://github.com/trinodb/trino/pull/10153] > In short the issue is that with large data size and guideposts enabled > (default) Phoenix' RoundRobinResultIterator starts scanning when tasks are > submitted to the queue. For large datasets (per client) this fills the heap > with pre-fetches HBase result objects. > MapReduce (and Spark) integrations have presumably the same issue. > My proposed solution is instead of allowing Phoenix to do intra-split > parallelism we create more splits (the fix above groups 20 scans into a split > - 20 turned out to be a good number). -- This message was sent by Atlassian Jira (v8.20.1#820001)