Hi, I tested Tajo before half a year, then not focus on Tajo because some other works.
then I setup a small dev Tajo cluster this week.(six nodes, VM) based on Hadoop-2.6.0. so my questions is: 1) From I know half a yea ago, Tajo is work on Yarn, using Yarn scheduler to manage job resources. but now I found it doesn't rely on Yarn, because I only start HDFS daemons, no yarn daemons. so Tajo has his own job sheduler ? 2) Does that we need to put the file replications on every nodes on Tajo cluster? such as I have a six nodes Tajo cluster, then should I set HDFS block replication to six? because: I noticed when I run Tajo query, some nodes are busy, but some is free. because the file's blocks are only located on these nodes. non others. 3)the test data set is 4 million rows. nearly several GB. but it's very slow when I runing: select count(distinct ID) from ****; Any possible problems here? Thanks
