[
https://issues.apache.org/jira/browse/IMPALA-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163144#comment-17163144
]
Tim Armstrong commented on IMPALA-4746:
---------------------------------------
IMPALA-8125 implements this for write fragments only - that allows controlling
# of small files without affecting the rest of the query plan.
> num_nodes should take any value and use that many nodes
> -------------------------------------------------------
>
> Key: IMPALA-4746
> URL: https://issues.apache.org/jira/browse/IMPALA-4746
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 2.6.0
> Reporter: Peter Ebert
> Priority: Major
>
> Ideally num_nodes should be expanded to allow any # of nodes (up to total
> number of Impalad's in the cluster). This would randomly select nodes to run
> fragments on, prioritizing data locallity (where possible). Locality gets
> more complex if there are multiple scans, perhaps defaulting to the largest
> table might work best
> This would conceivably be helpful in several scenarios
> * Scale performance testing: could run a query with num_nodes = 10 vs 20 to
> get a rough idea of scaling performance.
> * Avoid writing small files: if doing an insert on larger clusters (200+),
> even fairly large data sets can be spread out too thin. Setting # of nodes
> would allow you to control small files which then simplifies read query plans
> and # of fragments.
> * Control cluster utilization: some queries may perform just as well with
> less nodes, particularly where there are broadcast joins where data would be
> sent over the network anyways. There may be some multi-tenancy improvements
> by restricting the number of nodes some queries can use.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]