RCFile with snappy was generated by my MR job, which read text file and output RCFile, compressed with Snappy-1.1.2.
On Mon, Mar 16, 2015 at 3:13 PM, Azuryy Yu <[email protected]> wrote: > PS. my Parquet data was generated by Impala: "Insert into a parquet table > [SHUFFLE] ... AS select .... from a text table" > > On Mon, Mar 16, 2015 at 3:11 PM, Azuryy Yu <[email protected]> wrote: > >> Hi Jihoon, >> >> Here is an example: >> My data: (Parquet file is 1GB limited) >> hadoop fs -ls /data/basetable/par/dt=20150301/pf=pc >> >> -rw-r--r-- 9 hadoop tajo 1062932057 2015-03-12 15:08 >> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.0.parq >> -rw-r--r-- 9 hadoop tajo 1063205684 2015-03-12 15:11 >> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.1.parq >> -rw-r--r-- 9 hadoop tajo 1063236005 2015-03-12 15:14 >> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.2.parq >> -rw-r--r-- 9 hadoop tajo 543786632 2015-03-12 15:16 >> /data/basetable/par/dt=20150301/pf=pc/cc456c9d427c88a3-3ead7e35ecf0da8_448517166_data.3.parq >> >> hadoop fs -ls /data/basetable/snappy/dt=20150301/pf=pc >> >> -rw-r--r-- 9 tajo tajo 144059045 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00000 >> -rw-r--r-- 9 tajo tajo 144178118 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00001 >> -rw-r--r-- 9 tajo tajo 143642438 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00002 >> -rw-r--r-- 9 tajo tajo 143553142 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00003 >> -rw-r--r-- 9 tajo tajo 143849627 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00004 >> -rw-r--r-- 9 tajo tajo 144648456 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00005 >> -rw-r--r-- 9 tajo tajo 144647502 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00006 >> -rw-r--r-- 9 tajo tajo 144551053 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00007 >> -rw-r--r-- 9 tajo tajo 144017287 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00008 >> -rw-r--r-- 9 tajo tajo 144205111 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00009 >> -rw-r--r-- 9 tajo tajo 145066506 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00010 >> -rw-r--r-- 9 tajo tajo 144740791 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00011 >> -rw-r--r-- 9 tajo tajo 144198266 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00012 >> -rw-r--r-- 9 tajo tajo 143575440 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00013 >> -rw-r--r-- 9 tajo tajo 143922343 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00014 >> -rw-r--r-- 9 tajo tajo 143930019 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00015 >> -rw-r--r-- 9 tajo tajo 144253019 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00016 >> -rw-r--r-- 9 tajo tajo 144175506 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00017 >> -rw-r--r-- 9 tajo tajo 143072995 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00018 >> -rw-r--r-- 9 tajo tajo 143818118 2015-03-16 11:48 >> /data/basetable/snappy/dt=20150301/pf=pc/part-r-00019 >> >> Result: >> >> default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as >> bigint)),sum(cast(movie_pt as bigint)) from snappy where pf='pc'; >> Progress: 19%, response time: 1.87 sec >> Progress: 19%, response time: 1.873 sec >> Progress: 19%, response time: 2.276 sec >> Progress: 100%, response time: 2.372 sec >> ?sum_3, ?sum_4, ?sum_5 >> ------------------------------- >> 6928463, 6183665, 6055494385 >> (1 rows, 2.372 sec, 27 B selected) >> default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as >> bigint)),sum(cast(movie_pt as bigint)) from par where pf='pc'; >> Progress: 0%, response time: 0.751 sec >> Progress: 0%, response time: 0.753 sec >> Progress: 0%, response time: 1.155 sec >> Progress: 0%, response time: 1.959 sec >> Progress: 0%, response time: 2.962 sec >> Progress: 0%, response time: 3.965 sec >> Progress: 0%, response time: 4.968 sec >> Progress: 0%, response time: 5.97 sec >> Progress: 12%, response time: 6.974 sec >> Progress: 12%, response time: 7.977 sec >> Progress: 12%, response time: 8.979 sec >> Progress: 12%, response time: 9.982 sec >> Progress: 25%, response time: 10.985 sec >> Progress: 100%, response time: 11.14 sec >> ?sum_3, ?sum_4, ?sum_5 >> ------------------------------- >> 6928463, 6183665, 6055494385 >> (1 rows, 11.14 sec, 27 B selected) >> >> On Mon, Mar 16, 2015 at 2:58 PM, Jihoon Son <[email protected]> wrote: >> >>> Azuryy, thanks for your feedbacks. >>> They are very interesting results. >>> Would you mind telling me how Tajo with Parquet is slower than Tajo with >>> RCFile? >>> >>> Thanks, >>> Jihoon >>> >>> On Mon, Mar 16, 2015 at 3:39 PM Hyunsik Choi <[email protected]> wrote: >>> >>> > Hi Azuryy, >>> > >>> > Thank for sharing the test results. They are very inspiring to us. >>> > Also, I'll make some jira about the problems that you found. >>> > >>> > Best regards, >>> > Hyunsik >>> > >>> > On Sun, Mar 15, 2015 at 10:58 PM, Azuryy Yu <[email protected]> >>> wrote: >>> > > Another fix: >>> > > My test result is unfair during compare Imapla-2.1.2 and Tajo-0.10.0, >>> > > because I used Parquet with Impala and RCFILE snappy with Tajo. I >>> should >>> > > use the same file format to compare. >>> > > >>> > > because I've got a clear conclusion that Imapala works better on >>> Parquet >>> > > than Tajo, so I use RCFILE as the test data. >>> > > >>> > > *Tajo*: >>> > > default> select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as >>> > > bigint)),sum(cast(movie_pt as bigint)) from snappy; >>> > > Progress: 0%, response time: 1.598 sec >>> > > Progress: 0%, response time: 1.6 sec >>> > > Progress: 0%, response time: 2.003 sec >>> > > Progress: 0%, response time: 2.806 sec >>> > > Progress: 37%, response time: 3.808 sec >>> > > Progress: 100%, response time: 4.792 sec >>> > > ?sum_3, ?sum_4, ?sum_5 >>> > > ------------------------------- >>> > > 22557920, 19648838, 2005366694576 >>> > > (1 rows, 4.792 sec, 32 B selected) >>> > > >>> > > *Impala*: >>> > > > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as >>> > > bigint)),sum(cast(movie_pt as bigint)) from snappy; >>> > > +-------------------------------+--------------------------- >>> > ----+-------------------------------+ >>> > > | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint)) | >>> > > sum(cast(movie_pt as bigint)) | >>> > > +-------------------------------+--------------------------- >>> > ----+-------------------------------+ >>> > > | 22557920 | 19648838 | >>> > > 2005366694576 | >>> > > +-------------------------------+--------------------------- >>> > ----+-------------------------------+ >>> > > Fetched 1 row(s) in 11.12s >>> > > >>> > > >>> > > >>> > > On Mon, Mar 16, 2015 at 1:49 PM, Azuryy Yu <[email protected]> >>> wrote: >>> > > >>> > >> There is a typo in my Email. I corrected here: >>> > >> >>> > >> for example: >>> > >> >>> > >> <property> >>> > >> <name>tajo.master.umbilical-rpc.address</name> >>> > >> <value>1-1-1-1:26001</value> >>> > >> </property> >>> > >> >>> > >> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601" is >>> not >>> > a >>> > >> valid network address under tajo-0.10.0. >>> > >> >>> > >> I have to change to: >>> > >> <property> >>> > >> <name>tajo.master.umbilical-rpc.address</name> >>> > >> <value>1.1.1.1:26001</value> >>> > >> </property> >>> > >> >>> > >> >>> > >> On Mon, Mar 16, 2015 at 1:44 PM, Azuryy Yu <[email protected]> >>> wrote: >>> > >> >>> > >>> Hi, >>> > >>> I compiled tajo-0.10 source based on hadoop-2.6.0, then post some >>> > >>> feedback here. >>> > >>> >>> > >>> My cluster: >>> > >>> 1 tajo-master, 9 tajo-worker >>> > >>> 24 CPU(logic), 64GB mem, 4TB*12 HDD >>> > >>> >>> > >>> Feedback: >>> > >>> 1) tajo task progress estimate is normal on partitioned table, >>> which is >>> > >>> incorrect sometimes in tajo-0.9.0 >>> > >>> 2) Tajo configuration doesn't support hostname in tajo-site.xml. >>> > >>> for example: >>> > >>> >>> > >>> <property> >>> > >>> <name>tajo.master.umbilical-rpc.address</name> >>> > >>> <value>1-1-1-1:26001</value> >>> > >>> </property> >>> > >>> >>> > >>> which does work under tajo-0.9.0, but it complain "1-1-1-1:2601" is >>> > not a >>> > >>> valid network address. >>> > >>> >>> > >>> I have to change to: >>> > >>> <property> >>> > >>> <name>tajo.master.umbilical-rpc.address</name> >>> > >>> <value>1.1.1.1:26001</value> >>> > >>> </property> >>> > >>> >>> > >>> but we don't use IP in our cluster, only hostname. so I did a >>> little in >>> > >>> the code: >>> > >>> org.apache.tajo.validation.NetworkAddressValidator.java: >>> > >>> hostnamePattern = Pattern.compile("\\d*-\\d*-\\d*-\\d"); >>> > >>> then It works. >>> > >>> >>> > >>> 3) I did some test on the parquet, RCFILE(snappy compressed), >>> > >>> RCFILE(GZIP compressed) >>> > >>> >>> > >>> they are the same data, only different from file format. >>> > >>> the table has six partitions, 20 RCFILES, each parquet file is 1GB. >>> > >>> >>> > >>> then rcfile with snappy's performance is similiar to rcfile with >>> gzip. >>> > >>> but they are all two~three times better than parquet. >>> > >>> >>> > >>> 4) I compared tajo-0.10 and Impala-2.1.2, >>> > >>> Impala can provide very good support for parquet. more better than >>> > Tajo. >>> > >>> >>> > >>> but impala is more *slow *with other format than Tajo. >>> > >>> such as(I don't use WHERE because I want query all six partitions >>> > >>> together): >>> > >>> >>> > >>> *Impala*: >>> > >>> > select sum (cast(movie_vv as bigint)), sum(cast(movie_cv as >>> > >>> bigint)),sum(cast(movie_pt as bigint)) from par; >>> > >>> >>> > >>> +-------------------------------+--------------------------- >>> > ----+-------------------------------+ >>> > >>> | sum(cast(movie_vv as bigint)) | sum(cast(movie_cv as bigint)) | >>> > >>> sum(cast(movie_pt as bigint)) | >>> > >>> >>> > >>> +-------------------------------+--------------------------- >>> > ----+-------------------------------+ >>> > >>> | 22557920 | 19648838 | >>> > >>> 2005366694576 | >>> > >>> >>> > >>> +-------------------------------+--------------------------- >>> > ----+-------------------------------+ >>> > >>> Fetched 1 row(s) in 6.02s >>> > >>> >>> > >>> *Tajo:* >>> > >>> >>> > >>> *default*> select sum (cast(movie_vv as bigint)), >>> sum(cast(movie_cv as >>> > >>> bigint)),sum(cast(movie_pt as bigint)) from snappy; >>> > >>> Progress: 0%, response time: 1.598 sec >>> > >>> Progress: 0%, response time: 1.6 sec >>> > >>> Progress: 0%, response time: 2.003 sec >>> > >>> Progress: 0%, response time: 2.806 sec >>> > >>> Progress: 37%, response time: 3.808 sec >>> > >>> Progress: 100%, response time: 4.792 sec >>> > >>> ?sum_3, ?sum_4, ?sum_5 >>> > >>> ------------------------------- >>> > >>> 22557920, 19648838, 2005366694576 >>> > >>> (1 rows, 4.792 sec, 32 B selected) >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >> >>> > >>> >> >> >
