ok, I will try to do some contribution。 2017-08-23 12:06 GMT+08:00 Alexander Behm <[email protected]>:
> Hi Yu, > > thanks for your interest in contributing. A 2x speedup sounds great! > > We had tried to implement the feature you describe in this JIRA: > https://issues.apache.org/jira/browse/IMPALA-4105 > > There's a link to an abandoned code review with an interesting discussion > on the challenges of the feature. > Maybe you have additional thoughts and we can continue the discussion on > that JIRA? > > Best, > > Alex > > On Tue, Aug 22, 2017 at 8:34 PM, Edward Capriolo <[email protected]> > wrote: > > > Previously I found that if you run any command that touches the > partition, > > like adding properties it caused a refresh of that partition. > > > > On Tue, Aug 22, 2017 at 10:40 PM, yu feng <[email protected]> wrote: > > > > > Hi, community : > > > > > > I am a improvement modify to impala in our env, and I want to > > contribute > > > it to impala community , This is our scenario: > > > > > > we have a table with three or four partition keys, and the table > have > > > almost 1K partition to be added, and a spark streaming job write new > data > > > to existing partitions every 15 min(add to recent 7 days), so we have > to > > > refresh the recent 7 days partition, about 7K partitions. > > > > > > However, the whole table have 10W partitions and growing, we have > two > > > chioce: refresh the whole table or refresh the 7K partitions, we > obvious > > > should select to refresh table, but It will take 5min to be finish, I > > check > > > the code(before 2.8.0) and find refreshing table will finally call the > > > function : > > > > > > HdfsTable.load(true, client, msTbl, true, true, null); > > > > > > which will try to reload metadata and check every partition existing in > > the > > > table, and load eveny file to check whether file is updated or newly > > > created by checking last ModificationTime and file length. > > > > > > In our table, there are about 100W files, so the refresh table > operation > > is > > > slowly. > > > > > > Hence, we create a new usage: REFRESH TABLE xxx PARTITION (day = > ('xx1', > > > 'xx2', 'xx3'}); and the operation will just refresh partitions match > the > > > day in (xx1/xx2/xx3), in this way, we can only load files and > partitions > > in > > > the last 7 days. > > > > > > After our test, we find in this way, we speed the operation 2x times. > > > > > > Do you have any suggestion about it ? Thanks a lot. > > > > > >
