Previously I found that if you run any command that touches the partition, like adding properties it caused a refresh of that partition.
On Tue, Aug 22, 2017 at 10:40 PM, yu feng <[email protected]> wrote: > Hi, community : > > I am a improvement modify to impala in our env, and I want to contribute > it to impala community , This is our scenarioļ¼ > > we have a table with three or four partition keys, and the table have > almost 1K partition to be added, and a spark streaming job write new data > to existing partitions every 15 min(add to recent 7 days), so we have to > refresh the recent 7 days partition, about 7K partitions. > > However, the whole table have 10W partitions and growing, we have two > chioce: refresh the whole table or refresh the 7K partitions, we obvious > should select to refresh table, but It will take 5min to be finish, I check > the code(before 2.8.0) and find refreshing table will finally call the > function : > > HdfsTable.load(true, client, msTbl, true, true, null); > > which will try to reload metadata and check every partition existing in the > table, and load eveny file to check whether file is updated or newly > created by checking last ModificationTime and file length. > > In our table, there are about 100W files, so the refresh table operation is > slowly. > > Hence, we create a new usage: REFRESH TABLE xxx PARTITION (day = ('xx1', > 'xx2', 'xx3'}); and the operation will just refresh partitions match the > day in (xx1/xx2/xx3), in this way, we can only load files and partitions in > the last 7 days. > > After our test, we find in this way, we speed the operation 2x times. > > Do you have any suggestion about it ? Thanks a lot. >
