Hi, I read theTajo-0.9.0 source code, I found Tajo using a simple FIFO scheduler,
I accept this in the current stage. but when Tajo peek a query from the scheduler queue, then allocate workers for this query, Allocator only consider availale resource on a random worker list, then specify a set of workers. 1) so My question is why we don't consider HDFS locatbility? otherwise network will be the bottleneck. I understand Tajo don't use YARN as a scheduler currently. and write a temporary simple FIFO scheduler. and I am also looked at https://issues.apache.org/jira/browse/TAJO-540 , I hope new Tajo scheduler similar to Sparrow. 2) performance related. I setup 10 nodes clusters, (1 master, 9 workers) 64GB mem, 24CPU, 12*4TB HDD, 1.6GB test data.(160 million records) It's works good for some agg sql tests except count(distinct) count(distinct) is very slow - ten minutes. who can give me a simple explanation of how Tajo works with count(distinct), I can share my tajo-site here: <configuration> <property> <name>tajo.rootdir</name> <value>hdfs://realtime-cluster/tajo</value> </property> <!-- master --> <property> <name>tajo.master.umbilical-rpc.address</name> <value>xx:26001</value> </property> <property> <name>tajo.master.client-rpc.address</name> <value>xx:26002</value> </property> <property> <name>tajo.master.info-http.address</name> <value>xx:26080</value> </property> <property> <name>tajo.resource-tracker.rpc.address</name> <value>xx:26003</value> </property> <property> <name>tajo.catalog.client-rpc.address</name> <value>xx:26005</value> </property> <!-- worker --> <property> <name>tajo.worker.tmpdir.locations</name> <value>file:///data/hadoop/data1/tajo,file:///data/hadoop/data2/tajo,file:///data/hadoop/data3/tajo,file:///data/hadoop/data4/tajo,file:///data/hadoop/data5/tajo,file:///data/hadoop/data6/tajo,file:///data/hadoop/data7/tajo,file:///data/hadoop/data8/tajo,file:///data/hadoop/data9/tajo,file:///data/hadoop/data10/tajo,file:///data/hadoop/data11/tajo,file:///data/hadoop/data12/tajo</value> </property> <property> <name>tajo.worker.tmpdir.cleanup-at-startup</name> <value>true</value> </property> <property> <name>tajo.worker.history.expire-interval-minutes</name> <value>60</value> </property> <property> <name>tajo.worker.resource.tajo.worker.resource.cpu-cores</name> <value>24</value> </property> <property> <name>tajo.worker.resource.memory-mb</name> <value>60512</value> <!-- 3584 3 tasks + 1 qm task --> </property> <property> <name>tajo.task.memory-slot-mb.default</name> <value>3000</value> <!-- default 512 --> </property> <property> <name>tajo.task.disk-slot.default</name> <value>1.0f</value> <!-- default 0.5 --> </property> <property> <name>tajo.shuffle.fetcher.parallel-execution.max-num</name> <value>5</value> </property> <property> <name>tajo.executor.external-sort.thread-num</name> <value>2</value> </property> <!-- client --> <property> <name>tajo.rpc.client.worker-thread-num</name> <value>4</value> </property> <property> <name>tajo.cli.print.pause</name> <value>false</value> </property> <!-- <property> <name>tajo.worker.resource.dfs-dir-aware</name> <value>true</value> </property> <property> <name>tajo.worker.resource.dedicated</name> <value>true</value> </property> <property> <name>tajo.worker.resource.dedicated-memory-ratio</name> <value>0.6</value> </property> --> </configuration> tajo-env: export TAJO_WORKER_HEAPSIZE=60000
