Another fix: My test result is unfair during compare Imapla-2.1.2 and Tajo-0.10.0, because I used Parquet with Impala and RCFILE snappy with Tajo. I should use the same file format to compare.
because I've got a clear conclusion that Imapala works better on Parquet than Tajo, so I use RCFILE as the test data. *Tajo*: default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as bigint)),sum(cast(movie_pt as bigint)) from snappy; Progress: 0%, response time: 1.598 sec Progress: 0%, response time: 1.6 sec Progress: 0%, response time: 2.003 sec Progress: 0%, response time: 2.806 sec Progress: 37%, response time: 3.808 sec Progress: 100%, response time: 4.792 sec ?sum_3, ?sum_4, ?sum_5 ------------------------------- 22557920, 19648838, 2005366694576 (1 rows, 4.792 sec, 32 B selected) *Impala*: > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as bigint)),sum(cast(movie_pt as bigint)) from snappy; +-------------------------------+-------------------------------+-------------------------------+ | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint)) | sum(cast(movie_pt as bigint)) | +-------------------------------+-------------------------------+-------------------------------+ | 22557920 | 19648838 | 2005366694576 | +-------------------------------+-------------------------------+-------------------------------+ Fetched 1 row(s) in 11.12s On Mon, Mar 16, 2015 at 1:49 PM, Azuryy Yu <[email protected]> wrote: > There is a typo in my Email. I corrected here: > > for example: > > <property> > <name>tajo.master.umbilical-rpc.address</name> > <value>1-1-1-1:26001</value> > </property> > > which does work under tajo-0.9.0, but it complain "1-1-1-1:2601" is not a > valid network address under tajo-0.10.0. > > I have to change to: > <property> > <name>tajo.master.umbilical-rpc.address</name> > <value>1.1.1.1:26001</value> > </property> > > > On Mon, Mar 16, 2015 at 1:44 PM, Azuryy Yu <[email protected]> wrote: > >> Hi, >> I compiled tajo-0.10 source based on hadoop-2.6.0, then post some >> feedback here. >> >> My cluster: >> 1 tajo-master, 9 tajo-worker >> 24 CPU(logic), 64GB mem, 4TB*12 HDD >> >> Feedback: >> 1) tajo task progress estimate is normal on partitioned table, which is >> incorrect sometimes in tajo-0.9.0 >> 2) Tajo configuration doesn't support hostname in tajo-site.xml. >> for example: >> >> <property> >> <name>tajo.master.umbilical-rpc.address</name> >> <value>1-1-1-1:26001</value> >> </property> >> >> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601" is not a >> valid network address. >> >> I have to change to: >> <property> >> <name>tajo.master.umbilical-rpc.address</name> >> <value>1.1.1.1:26001</value> >> </property> >> >> but we don't use IP in our cluster, only hostname. so I did a little in >> the code: >> org.apache.tajo.validation.NetworkAddressValidator.java: >> hostnamePattern = Pattern.compile("\\d*-\\d*-\\d*-\\d"); >> then It works. >> >> 3) I did some test on the parquet, RCFILE(snappy compressed), >> RCFILE(GZIP compressed) >> >> they are the same data, only different from file format. >> the table has six partitions, 20 RCFILES, each parquet file is 1GB. >> >> then rcfile with snappy's performance is similiar to rcfile with gzip. >> but they are all two~three times better than parquet. >> >> 4) I compared tajo-0.10 and Impala-2.1.2, >> Impala can provide very good support for parquet. more better than Tajo. >> >> but impala is more *slow *with other format than Tajo. >> such as(I don't use WHERE because I want query all six partitions >> together): >> >> *Impala*: >> > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as >> bigint)),sum(cast(movie_pt as bigint)) from par; >> >> +-------------------------------+-------------------------------+-------------------------------+ >> | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint)) | >> sum(cast(movie_pt as bigint)) | >> >> +-------------------------------+-------------------------------+-------------------------------+ >> | 22557920 | 19648838 | >> 2005366694576 | >> >> +-------------------------------+-------------------------------+-------------------------------+ >> Fetched 1 row(s) in 6.02s >> >> *Tajo:* >> >> *default*> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as >> bigint)),sum(cast(movie_pt as bigint)) from snappy; >> Progress: 0%, response time: 1.598 sec >> Progress: 0%, response time: 1.6 sec >> Progress: 0%, response time: 2.003 sec >> Progress: 0%, response time: 2.806 sec >> Progress: 37%, response time: 3.808 sec >> Progress: 100%, response time: 4.792 sec >> ?sum_3, ?sum_4, ?sum_5 >> ------------------------------- >> 22557920, 19648838, 2005366694576 >> (1 rows, 4.792 sec, 32 B selected) >> >> >> >> >> >> >> >> >> >> >
