[jira] [Commented] (IMPALA-6372) Dataload should execute Hive loads in parallel

ASF subversion and git services (JIRA) Thu, 10 May 2018 09:27:14 -0700

    [ 
https://issues.apache.org/jira/browse/IMPALA-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470657#comment-16470657
 ]


ASF subversion and git services commented on IMPALA-6372:
---------------------------------------------------------

Commit b126b2d1053bde6671701af3931c7424a646cd54 in impala's branch 
refs/heads/master from [~joemcdonnell]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=b126b2d ]

IMPALA-6972: Disable parallel dataload on MINICLUSTER_PROFILE=2

There is a Hive bug in Hive 1.1.0 that can result
in a NullPointerException when doing parallel Hive
operations (see IMPALA-6532). Since dataload goes
parallel on Hive loads starting with IMPALA-6372,
dataload can hit this error on Hive 1.1.0 (i.e.
IMPALA_MINICLUSTER_PROFILE=2). This is impacting
builds on the 2.x branch.

This disables parallel dataload for IMPALA_MINICLUSTER_PROFILE=2.

IMPALA_MINICLUSTER_PROFILE=3 uses a newer version
of Hive that has a fix for this, so this continues
to use parallel dataload for that case.

Parallelism can be reenabled when Hive 1.1.0 gets the
fix from Hive 2.1.1.

Change-Id: I90a0f2b3756d7192fa7db2958031b8c88eb606e6
Reviewed-on: http://gerrit.cloudera.org:8080/10306
Reviewed-by: Philip Zeyliger <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Dataload should execute Hive loads in parallel
> ----------------------------------------------
>
>                 Key: IMPALA-6372
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6372
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 2.12.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>             Fix For: Impala 3.0
>
>
> Currently, bin/load-data.py goes parallel on parts of the Impala DDLs and 
> Impala inserts (for Parquet), but it does not go parallel on the Hive portion 
> of dataload. testdata/bin/generate-schema-statements.py only generates a 
> single Hive file that load-data.py executes serially.
> generate-schema-statements.py should generate multiple SQL files for the Hive 
> load portion and load-data.py should execute them in parallel (once the base 
> text tables have been loaded).
> Even with parallel execution of TPC-H, functional, and TPC-DS, this will 
> still deliver speedups for dataload (and GVO).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-6372) Dataload should execute Hive loads in parallel

Reply via email to