Lars Hofhansl created PHOENIX-6608:
--------------------------------------
Summary: DISCUSS: Rethink MapReduce split generation
Key: PHOENIX-6608
URL: https://issues.apache.org/jira/browse/PHOENIX-6608
Project: Phoenix
Issue Type: Improvement
Reporter: Lars Hofhansl
I just ran into an issue with Trino, which uses Phoenix' M/R integration to
generate splits for its worker nodes.
See: [https://github.com/trinodb/trino/issues/10143]
And a fix: [https://github.com/trinodb/trino/pull/10153]
In short the issue is that with large data size and guideposts enabled
(default) Phoenix' RoundRobinResultIterator starts scanning when tasks are
submitted to the queue. For large datasets (per client) this fills the heap
with pre-fetches HBase result objects.
MapReduce (and Spark) integrations have presumably the same issue.
My proposed solution is instead of allowing Phoenix to do intra-split
parallelism we create more splits (the fix above groups 20 scans into a split -
20 turned out to be a good number).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)