Does anyone kindly give me some answer? Thanks
On Wed, Mar 4, 2015 at 6:12 PM, Azuryy Yu <[email protected]> wrote: > Hi, > > I read theTajo-0.9.0 source code, I found Tajo using a simple FIFO > scheduler, > > I accept this in the current stage. but when Tajo peek a query from the > scheduler queue, then allocate workers for this query, > > Allocator only consider availale resource on a random worker list, then > specify a set of workers. > > 1) > so My question is why we don't consider HDFS locatbility? otherwise > network will be the bottleneck. > > I understand Tajo don't use YARN as a scheduler currently. and write a > temporary simple FIFO scheduler. and I am also looked at > https://issues.apache.org/jira/browse/TAJO-540 , I hope new Tajo > scheduler similar to Sparrow. > > 2) performance related. > I setup 10 nodes clusters, (1 master, 9 workers) > > 64GB mem, 24CPU, 12*4TB HDD, 1.6GB test data.(160 million records) > > It's works good for some agg sql tests except count(distinct) > > count(distinct) is very slow - ten minutes. > > who can give me a simple explanation of how Tajo works with > count(distinct), I can share my tajo-site here: > > <configuration> > <property> > <name>tajo.rootdir</name> > <value>hdfs://realtime-cluster/tajo</value> > </property> > > <!-- master --> > <property> > <name>tajo.master.umbilical-rpc.address</name> > <value>xx:26001</value> > </property> > <property> > <name>tajo.master.client-rpc.address</name> > <value>xx:26002</value> > </property> > <property> > <name>tajo.master.info-http.address</name> > <value>xx:26080</value> > </property> > <property> > <name>tajo.resource-tracker.rpc.address</name> > <value>xx:26003</value> > </property> > <property> > <name>tajo.catalog.client-rpc.address</name> > <value>xx:26005</value> > </property> > <!-- worker --> > <property> > <name>tajo.worker.tmpdir.locations</name> > > <value>file:///data/hadoop/data1/tajo,file:///data/hadoop/data2/tajo,file:///data/hadoop/data3/tajo,file:///data/hadoop/data4/tajo,file:///data/hadoop/data5/tajo,file:///data/hadoop/data6/tajo,file:///data/hadoop/data7/tajo,file:///data/hadoop/data8/tajo,file:///data/hadoop/data9/tajo,file:///data/hadoop/data10/tajo,file:///data/hadoop/data11/tajo,file:///data/hadoop/data12/tajo</value> > </property> > <property> > <name>tajo.worker.tmpdir.cleanup-at-startup</name> > <value>true</value> > </property> > <property> > <name>tajo.worker.history.expire-interval-minutes</name> > <value>60</value> > </property> > <property> > <name>tajo.worker.resource.tajo.worker.resource.cpu-cores</name> > <value>24</value> > </property> > <property> > <name>tajo.worker.resource.memory-mb</name> > <value>60512</value> <!-- 3584 3 tasks + 1 qm task --> > </property> > <property> > <name>tajo.task.memory-slot-mb.default</name> > <value>3000</value> <!-- default 512 --> > </property> > <property> > <name>tajo.task.disk-slot.default</name> > <value>1.0f</value> <!-- default 0.5 --> > </property> > <property> > <name>tajo.shuffle.fetcher.parallel-execution.max-num</name> > <value>5</value> > </property> > <property> > <name>tajo.executor.external-sort.thread-num</name> > <value>2</value> > </property> > <!-- client --> > <property> > <name>tajo.rpc.client.worker-thread-num</name> > <value>4</value> > </property> > <property> > <name>tajo.cli.print.pause</name> > <value>false</value> > </property> > <!-- > <property> > <name>tajo.worker.resource.dfs-dir-aware</name> > <value>true</value> > </property> > <property> > <name>tajo.worker.resource.dedicated</name> > <value>true</value> > </property> > <property> > <name>tajo.worker.resource.dedicated-memory-ratio</name> > <value>0.6</value> > </property> > --> > </configuration> > > > tajo-env: > > export TAJO_WORKER_HEAPSIZE=60000 > > > > > > > > >
