[jira] [Updated] (HIVE-6098) Merge Tez branch into trunk

Gunther Hagleitner (JIRA) Wed, 08 Jan 2014 17:57:21 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gunther Hagleitner updated HIVE-6098:
-------------------------------------

    Release Note: 
Here are the instructions for setting up Tez on your hadoop 2 cluster: 
https://github.com/apache/incubator-tez/blob/branch-0.2.0/INSTALL.txt

Notes:

- I start hive with "hive -hiveconf hive.execution.engine=tez", not exactly 
necessary, but it will start the AM/containers right away instead of on first 
query.
- hive-exec jar should be copied to hdfs:///user/hive/ (location can be changed 
with: hive.jar.directory). This avoids re-localization of the hive jar.

Hive settings:

// needed because SMB isn't supported on tez yet
set hive.optimize.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.auto.convert.sortmerge.join=false;
set hive.auto.convert.sortmerge.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask=true;

// depends on your available mem/cluster, but map/reduce mb should be set to 
the same for container reuse
set hive.auto.convert.join.noconditionaltask.size=64000000;
set mapred.map.child.java.opts=-server -Xmx3584m 
-Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xmx3584m 
-Djava.net.preferIPv4Stack=true;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;

// generic opts
set hive.optimize.reducededuplication.min.reducer=1;
set hive.optimize.mapjoin.mapreduce=true;

// autogather might require you to up the max number of counters, if you run 
into issues
set hive.stats.autogather=true;
set hive.stats.dbclass=counter;

// tea settings can also go into fez-site if desired
set mapreduce.map.output.compress=true;
set 
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
set tez.runtime.intermediate-output.should-compress=true;
set 
tez.runtime.intermediate-output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
set tez.runtime.intermdiate-input.is-compressed=true;
set 
tez.runtime.intermediate-input.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;

// tez groups in the AM
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

set hive.orc.splits.include.file.footer=true;

set hive.root.logger=ERROR,console;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled=true;
set hive.exec.local.cache=true;
set hive.compute.query.using.stats=true;

for tez:

  <property>
    <name>tez.am.resource.memory.mb</name>
    <value>8192</value>
  </property>
  <property>
    <name>tez.am.java.opts</name>
    <value>-server -Xmx7168m -Djava.net.preferIPv4Stack=true</value>
  </property>
  <property>
    <name>tez.am.grouping.min-size</name>
    <value>16777216</value>
  </property>
  <!-- Client Submission timeout value when submitting DAGs to a session -->
  <property>
    <name>tez.session.client.timeout.secs</name>
    <value>-1</value>
  </property>
  <!-- prewarm stuff -->
  <property>
    <name>tez.session.pre-warm.enabled</name>
    <value>true</value>
  </property>

  <property>
    <name>tez.session.pre-warm.num.containers</name>
    <value>10</value>
  </property>
  <property>
    <name>tez.am.grouping.split-waves</name>
    <value>0.9</value>
  </property>

  <property>
    <name>tez.am.container.reuse.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.reuse.rack-fallback.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.reuse.non-local-fallback.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.session.delay-allocation-millis</name>
    <value>-1</value>
  </property>
  <property>
    <name>tez.am.container.reuse.locality.delay-allocation-millis</name>
    <value>250</value>
  </property>


  was:
Notes:

- I start hive with "hive -hiveconf hive.execution.engine=tez", not exactly 
necessary, but it will start the AM/containers right away instead of on first 
query.
- hive-exec jar should be copied to hdfs:///user/hive/ (location can be changed 
with: hive.jar.directory). This avoids re-localization of the hive jar.

Hive settings:

// needed because SMB isn't supported on tez yet
set hive.optimize.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.auto.convert.sortmerge.join=false;
set hive.auto.convert.sortmerge.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask=true;

// depends on your available mem/cluster, but map/reduce mb should be set to 
the same for container reuse
set hive.auto.convert.join.noconditionaltask.size=64000000;
set mapred.map.child.java.opts=-server -Xmx3584m 
-Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xmx3584m 
-Djava.net.preferIPv4Stack=true;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;

// generic opts
set hive.optimize.reducededuplication.min.reducer=1;
set hive.optimize.mapjoin.mapreduce=true;

// autogather might require you to up the max number of counters, if you run 
into issues
set hive.stats.autogather=true;
set hive.stats.dbclass=counter;

// tea settings can also go into fez-site if desired
set mapreduce.map.output.compress=true;
set 
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
set tez.runtime.intermediate-output.should-compress=true;
set 
tez.runtime.intermediate-output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
set tez.runtime.intermdiate-input.is-compressed=true;
set 
tez.runtime.intermediate-input.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;

// tez groups in the AM
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

set hive.orc.splits.include.file.footer=true;

set hive.root.logger=ERROR,console;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled=true;
set hive.exec.local.cache=true;
set hive.compute.query.using.stats=true;

for tez:

  <property>
    <name>tez.am.resource.memory.mb</name>
    <value>8192</value>
  </property>
  <property>
    <name>tez.am.java.opts</name>
    <value>-server -Xmx7168m -Djava.net.preferIPv4Stack=true</value>
  </property>
  <property>
    <name>tez.am.grouping.min-size</name>
    <value>16777216</value>
  </property>
  <!-- Client Submission timeout value when submitting DAGs to a session -->
  <property>
    <name>tez.session.client.timeout.secs</name>
    <value>-1</value>
  </property>
  <!-- prewarm stuff -->
  <property>
    <name>tez.session.pre-warm.enabled</name>
    <value>true</value>
  </property>

  <property>
    <name>tez.session.pre-warm.num.containers</name>
    <value>10</value>
  </property>
  <property>
    <name>tez.am.grouping.split-waves</name>
    <value>0.9</value>
  </property>

  <property>
    <name>tez.am.container.reuse.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.reuse.rack-fallback.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.reuse.non-local-fallback.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.container.session.delay-allocation-millis</name>
    <value>-1</value>
  </property>
  <property>
    <name>tez.am.container.reuse.locality.delay-allocation-millis</name>
    <value>250</value>
  </property>



> Merge Tez branch into trunk
> ---------------------------
>
>                 Key: HIVE-6098
>                 URL: https://issues.apache.org/jira/browse/HIVE-6098
>             Project: Hive
>          Issue Type: New Feature
>    Affects Versions: 0.12.0
>            Reporter: Gunther Hagleitner
>            Assignee: Gunther Hagleitner
>         Attachments: HIVE-6098.1.patch, HIVE-6098.2.patch, HIVE-6098.3.patch, 
> HIVE-6098.4.patch, HIVE-6098.5.patch, HIVE-6098.6.patch, HIVE-6098.7.patch, 
> HIVE-6098.8.patch, HIVE-6098.9.patch, hive-on-tez-conf.txt
>
>
> I think the Tez branch is at a point where we can consider merging it back 
> into trunk after review. 
> Tez itself has had its first release, most hive features are available on Tez 
> and the test coverage is decent. There are a few known limitations, all of 
> which can be handled in trunk as far as I can tell (i.e.: None of them are 
> large disruptive changes that still require a branch.)
> Limitations:
> - Union all is not yet supported on Tez
> - SMB is not yet supported on Tez
> - Bucketed map-join is executed as broadcast join (bucketing is ignored)
> Since the user is free to toggle hive.optimize.tez, it's obviously possible 
> to just run these on MR.
> I am hoping to follow the approach that was taken with vectorization and 
> shoot for a merge instead of single commit. This would retain history of the 
> branch. Also in vectorization we required at least three +1s before merge, 
> I'm hoping to go with that as well.
> I will add a combined patch to this ticket for review purposes (not for 
> commit). I'll also attach instructions to run on a cluster if anyone wants to 
> try.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-6098) Merge Tez branch into trunk

Reply via email to