Hi, I'm using ORC format for our table storage. The table has a timestamp column(say TS) and 25 other columns. The other ORC properties we are using arestorage index and bloom filters. We are loading 100 million records in to this table on a 4-node cluster.
Our source table is a text table with CSV format. In the source table timestamp values come as BIGINT. In the INSERT SELECT, we use function "from_unixtime(sourceTable.TS)" to convert the BIGINT values to timestamp in the target ORC table. So the first INSERT SELECT in to non-partitioned table looks like this 1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from SOURCE. I wanted to test by partitioning the table by date derived from this timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT SELECT with dynamic partitioning. The second one is 2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1, col2... to_date(from_unixtime(ts)) as datecol from SOURCE. The load time increased by 50% from 1 to 2. I understand the second statement involves creating many more partition directories and files. Is there anyway we can improve the load time? In the second INSERT SELECT, will the result of the expression "from_unixtime(ts)" be reused in "to_date(from_unixtime(ts))"? Thanks, Naveen