It is plain json (1 json per line). Each json message size = ~4kb no. of json messages = ~5 Millions.
store.parquet.compression = snappy ( i don't think, this parameter get used. As I am querying select only.) On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz <[email protected]> wrote: > What is the data format within those .gz and .bz2 files ? It is parquet or > JSON or plain text (CSV) ? > Also, what was this config parameter `store.parquet.compression` set to, > when ypu ran your test ? > > - Khurram > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane <[email protected] > > > wrote: > > > Awaiting for response.. > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" <[email protected]> > wrote: > > > > > > > > > > I am Comparing Querying speed between GZ and BZ2. > > > > > > Below are the 2 files and their sizes (This 2 files have same data): > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G > > > > > > > > > > > > Results: > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ; > > > +------------+----------+ > > > | channelid | EXPR$1 | > > > +------------+----------+ > > > | 3 | 977134 | > > > | 0 | 836850 | > > > | 2 | 3202854 | > > > +------------+----------+ > > > 3 rows selected (86.034 seconds) > > > > > > > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid > ; > > > +------------+----------+ > > > | channelid | EXPR$1 | > > > +------------+----------+ > > > | 3 | 977134 | > > > | 0 | 836850 | > > > | 2 | 3202854 | > > > +------------+----------+ > > > 3 rows selected (459.079 seconds) > > > > > > > > > > > > Questions: > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ? > > > 2. How can we speed to up Bz2. Are there any configuration to do ? > > > 3. As bz2 is splittable format, How drill using it ? > > > > > > > > > regards, > > > shankar > > >
