1) We are using 4 indentical AWS machines. 8 vCPUs, 32 GB RAM. 1 TB storage
2) Setting up bloom filters only on two other string columns. Not all of
3) The data is any event data ex: Syslog.
4) Queries usually run on timestamp range with additional predicates on
other columns (mostly equality)
4) We use SNAPPY compression with 256 MB blocks.
5) ORC stripe size is 256MB, HDFS block size is 128 MB
6) The time for first INSERT is 206 seconds and the second one is 302
On Fri, Sep 16, 2016 at 4:57 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> What is your hardware setup?
> Are the bloom filters necessary on all columns? Usually they make only
> sense for non-numeric columns. Updating bloom filters take time and should
> be avoided where they do not make sense.
> Can you provide an example of the data and the select queries that you
> execute on them?
> Do you use compression on the tables? If so which?
> What are the exact times and data volumes?
> > On 15 Sep 2016, at 19:56, naveen mahadevuni <nmahadev...@gmail.com>
> > Hi,
> > I'm using ORC format for our table storage. The table has a timestamp
> > column(say TS) and 25 other columns. The other ORC properties we are
> > arestorage index and bloom filters. We are loading 100 million records in
> > to this table on a 4-node cluster.
> > Our source table is a text table with CSV format. In the source table
> > timestamp values come as BIGINT. In the INSERT SELECT, we use function
> > "from_unixtime(sourceTable.TS)" to convert the BIGINT values to
> > in the target ORC table. So the first INSERT SELECT in to non-partitioned
> > table looks like this
> > 1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from
> > I wanted to test by partitioning the table by date derived from this
> > timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT
> > with dynamic partitioning. The second one is
> > 2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
> > col2... to_date(from_unixtime(ts)) as datecol from SOURCE.
> > The load time increased by 50% from 1 to 2. I understand the second
> > statement involves creating many more partition directories and files.
> > Is there anyway we can improve the load time? In the second INSERT
> > will the result of the expression "from_unixtime(ts)" be reused in
> > "to_date(from_unixtime(ts))"?
> > Thanks,
> > Naveen