Tajo query scheduler and performance question

Azuryy Yu Wed, 04 Mar 2015 02:16:13 -0800

Hi,

I read theTajo-0.9.0  source code, I found Tajo using a simple FIFO
scheduler,


I accept this in the current stage. but when Tajo peek a query from the
scheduler queue, then allocate workers for this query,

Allocator only consider availale resource on a random worker list,  then
specify a set of workers.

1)
so My question is why we don't consider HDFS locatbility? otherwise network
will be the bottleneck.

I understand Tajo don't use YARN as a scheduler currently. and write a
temporary simple FIFO scheduler. and I am also looked at
https://issues.apache.org/jira/browse/TAJO-540 , I hope new Tajo scheduler
similar to Sparrow.

2) performance related.
I setup 10 nodes clusters, (1 master, 9 workers)

64GB mem, 24CPU, 12*4TB HDD,  1.6GB test data.(160 million records)

It's works good for some agg sql tests except count(distinct)

count(distinct) is very slow - ten minutes.

who can give me a simple explanation of how Tajo works with
count(distinct), I can share my tajo-site here:

<configuration>
  <property>
    <name>tajo.rootdir</name>
    <value>hdfs://realtime-cluster/tajo</value>
  </property>

  <!-- master -->
  <property>
    <name>tajo.master.umbilical-rpc.address</name>
    <value>xx:26001</value>
  </property>
  <property>
    <name>tajo.master.client-rpc.address</name>
    <value>xx:26002</value>
  </property>
  <property>
    <name>tajo.master.info-http.address</name>
    <value>xx:26080</value>
  </property>
  <property>
    <name>tajo.resource-tracker.rpc.address</name>
    <value>xx:26003</value>
  </property>
  <property>
    <name>tajo.catalog.client-rpc.address</name>
    <value>xx:26005</value>
  </property>
  <!--  worker  -->
  <property>
    <name>tajo.worker.tmpdir.locations</name>

<value>file:///data/hadoop/data1/tajo,file:///data/hadoop/data2/tajo,file:///data/hadoop/data3/tajo,file:///data/hadoop/data4/tajo,file:///data/hadoop/data5/tajo,file:///data/hadoop/data6/tajo,file:///data/hadoop/data7/tajo,file:///data/hadoop/data8/tajo,file:///data/hadoop/data9/tajo,file:///data/hadoop/data10/tajo,file:///data/hadoop/data11/tajo,file:///data/hadoop/data12/tajo</value>
  </property>
  <property>
    <name>tajo.worker.tmpdir.cleanup-at-startup</name>
    <value>true</value>
  </property>
  <property>
    <name>tajo.worker.history.expire-interval-minutes</name>
    <value>60</value>
  </property>
  <property>
    <name>tajo.worker.resource.tajo.worker.resource.cpu-cores</name>
    <value>24</value>
  </property>
  <property>
    <name>tajo.worker.resource.memory-mb</name>
    <value>60512</value> <!-- 3584 3 tasks + 1 qm task  -->
  </property>
  <property>
    <name>tajo.task.memory-slot-mb.default</name>
    <value>3000</value> <!--  default 512 -->
  </property>
  <property>
    <name>tajo.task.disk-slot.default</name>
    <value>1.0f</value> <!--  default 0.5 -->
  </property>
  <property>
    <name>tajo.shuffle.fetcher.parallel-execution.max-num</name>
    <value>5</value>
  </property>
  <property>
    <name>tajo.executor.external-sort.thread-num</name>
    <value>2</value>
  </property>
  <!-- client -->
  <property>
    <name>tajo.rpc.client.worker-thread-num</name>
    <value>4</value>
  </property>
  <property>
    <name>tajo.cli.print.pause</name>
    <value>false</value>
  </property>
<!--
  <property>
    <name>tajo.worker.resource.dfs-dir-aware</name>
    <value>true</value>
  </property>
  <property>
    <name>tajo.worker.resource.dedicated</name>
    <value>true</value>
  </property>
  <property>
    <name>tajo.worker.resource.dedicated-memory-ratio</name>
    <value>0.6</value>
  </property>
-->
</configuration>


tajo-env:

export TAJO_WORKER_HEAPSIZE=60000

Tajo query scheduler and performance question

Reply via email to