[
https://issues.apache.org/jira/browse/IMPALA-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470657#comment-16470657
]
ASF subversion and git services commented on IMPALA-6372:
---------------------------------------------------------
Commit b126b2d1053bde6671701af3931c7424a646cd54 in impala's branch
refs/heads/master from [~joemcdonnell]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=b126b2d ]
IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2
There is a Hive bug in Hive 1.1.0 that can result
in a NullPointerException when doing parallel Hive
operations (see IMPALA-6532). Since dataload goes
parallel on Hive loads starting with IMPALA-6372,
dataload can hit this error on Hive 1.1.0 (i.e.
IMPALA_MINICLUSTER_PROFILE=2). This is impacting
builds on the 2.x branch.
This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2.
IMPALA_MINICLUSTER_PROFILE=3 uses a newer version
of Hive that has a fix for this, so this continues
to use parallel dataload for that case.
Parallelism can be reenabled when Hive 1.1.0 gets the
fix from Hive 2.1.1.
Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6
Reviewed-on: http://gerrit.cloudera.org:8080/10306
Reviewed-by: Philip Zeyliger <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Dataload should execute Hive loads in parallel
> ----------------------------------------------
>
> Key: IMPALA-6372
> URL: https://issues.apache.org/jira/browse/IMPALA-6372
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Affects Versions: Impala 2.12.0
> Reporter: Joe McDonnell
> Assignee: Joe McDonnell
> Priority: Major
> Fix For: Impala 3.0
>
>
> Currently, bin/load-data.py goes parallel on parts of the Impala DDLs and
> Impala inserts (for Parquet), but it does not go parallel on the Hive portion
> of dataload. testdata/bin/generate-schema-statements.py only generates a
> single Hive file that load-data.py executes serially.
> generate-schema-statements.py should generate multiple SQL files for the Hive
> load portion and load-data.py should execute them in parallel (once the base
> text tables have been loaded).
> Even with parallel execution of TPC-H, functional, and TPC-DS, this will
> still deliver speedups for dataload (and GVO).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]