ok, I will try to do some contribution。

2017-08-23 12:06 GMT+08:00 Alexander Behm <[email protected]>:

> Hi Yu,
>
> thanks for your interest in contributing. A 2x speedup sounds great!
>
> We had tried to implement the feature you describe in this JIRA:
> https://issues.apache.org/jira/browse/IMPALA-4105
>
> There's a link to an abandoned code review with an interesting discussion
> on the challenges of the feature.
> Maybe you have additional thoughts and we can continue the discussion on
> that JIRA?
>
> Best,
>
> Alex
>
> On Tue, Aug 22, 2017 at 8:34 PM, Edward Capriolo <[email protected]>
> wrote:
>
> > Previously I found that if you run any command that touches the
> partition,
> > like adding properties it caused a refresh of that partition.
> >
> > On Tue, Aug 22, 2017 at 10:40 PM, yu feng <[email protected]> wrote:
> >
> > > Hi, community :
> > >
> > >    I am a improvement modify to impala in our env, and I want to
> > contribute
> > > it to impala community , This is our scenario:
> > >
> > >     we have a table with three or four partition keys, and the table
> have
> > > almost 1K partition to be added, and a spark streaming job write new
> data
> > > to existing partitions every 15 min(add to recent 7 days), so we have
> to
> > > refresh the recent 7 days partition, about 7K partitions.
> > >
> > >    However, the whole table have 10W partitions and growing, we have
> two
> > > chioce: refresh the whole table or refresh the 7K partitions, we
> obvious
> > > should select to refresh table, but It will take 5min to be finish, I
> > check
> > > the code(before 2.8.0) and find refreshing table will finally call the
> > > function :
> > >
> > > HdfsTable.load(true,  client, msTbl, true, true, null);
> > >
> > > which will try to reload metadata and check every partition existing in
> > the
> > > table, and load eveny file to check whether file is updated or newly
> > > created by checking last ModificationTime and file length.
> > >
> > > In our table, there are about 100W files, so the refresh table
> operation
> > is
> > > slowly.
> > >
> > > Hence, we create a new usage: REFRESH TABLE xxx PARTITION (day =
> ('xx1',
> > > 'xx2', 'xx3'}); and the operation will just refresh partitions match
> the
> > > day in (xx1/xx2/xx3), in this way, we can only load files and
> partitions
> > in
> > > the last 7 days.
> > >
> > > After our test, we find in this way, we speed the operation 2x times.
> > >
> > > Do you have any suggestion about it ?  Thanks a lot.
> > >
> >
>

Reply via email to