What is your hardware setup? Are the bloom filters necessary on all columns? Usually they make only sense for non-numeric columns. Updating bloom filters take time and should be avoided where they do not make sense. Can you provide an example of the data and the select queries that you execute on them? Do you use compression on the tables? If so which? What are the exact times and data volumes?
> On 15 Sep 2016, at 19:56, naveen mahadevuni <[email protected]> wrote: > > Hi, > > I'm using ORC format for our table storage. The table has a timestamp > column(say TS) and 25 other columns. The other ORC properties we are using > arestorage index and bloom filters. We are loading 100 million records in > to this table on a 4-node cluster. > > Our source table is a text table with CSV format. In the source table > timestamp values come as BIGINT. In the INSERT SELECT, we use function > "from_unixtime(sourceTable.TS)" to convert the BIGINT values to timestamp > in the target ORC table. So the first INSERT SELECT in to non-partitioned > table looks like this > > 1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from SOURCE. > > I wanted to test by partitioning the table by date derived from this > timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT SELECT > with dynamic partitioning. The second one is > > 2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1, > col2... to_date(from_unixtime(ts)) as datecol from SOURCE. > > The load time increased by 50% from 1 to 2. I understand the second > statement involves creating many more partition directories and files. > > Is there anyway we can improve the load time? In the second INSERT SELECT, > will the result of the expression "from_unixtime(ts)" be reused in > "to_date(from_unixtime(ts))"? > > Thanks, > Naveen
