Load performance with partitioned table

naveen mahadevuni Thu, 15 Sep 2016 10:56:51 -0700

Hi,

I'm using ORC format for our table storage. The table has a timestamp
column(say TS) and 25 other columns. The other ORC properties we are using
arestorage index and bloom filters. We are loading 100 million records in
to this table on a 4-node cluster.


Our source table is a text table with CSV format. In the source table
timestamp values come as BIGINT. In the INSERT SELECT, we use function
"from_unixtime(sourceTable.TS)" to convert the BIGINT values to timestamp
in the target ORC table. So the first INSERT SELECT in to non-partitioned
table looks like this

1) INSERT INTO TARGET SELECT from_unixtime(ts), col1, col2... from SOURCE.

I wanted to test by partitioning the table by date derived from this
timestamp, so I used "to_date(from_unixtime(TS))" in the new INSERT SELECT
with dynamic partitioning. The second one is

2) INSERT INTO TARGET PARTITION(datecol) SELECT from_unixtime(ts), col1,
col2... to_date(from_unixtime(ts)) as datecol from SOURCE.

The load time increased by 50% from 1 to 2. I understand the second
statement involves creating many more partition directories and files.

Is there anyway we can improve the load time? In the second INSERT SELECT,
will the result of the expression "from_unixtime(ts)" be reused in
"to_date(from_unixtime(ts))"?

Thanks,
Naveen

Load performance with partitioned table

Reply via email to