Hi Yu,

thanks for your interest in contributing. A 2x speedup sounds great!

We had tried to implement the feature you describe in this JIRA:
https://issues.apache.org/jira/browse/IMPALA-4105

There's a link to an abandoned code review with an interesting discussion
on the challenges of the feature.
Maybe you have additional thoughts and we can continue the discussion on
that JIRA?

Best,

Alex

On Tue, Aug 22, 2017 at 8:34 PM, Edward Capriolo <[email protected]>
wrote:

> Previously I found that if you run any command that touches the partition,
> like adding properties it caused a refresh of that partition.
>
> On Tue, Aug 22, 2017 at 10:40 PM, yu feng <[email protected]> wrote:
>
> > Hi, community :
> >
> >    I am a improvement modify to impala in our env, and I want to
> contribute
> > it to impala community , This is our scenario:
> >
> >     we have a table with three or four partition keys, and the table have
> > almost 1K partition to be added, and a spark streaming job write new data
> > to existing partitions every 15 min(add to recent 7 days), so we have to
> > refresh the recent 7 days partition, about 7K partitions.
> >
> >    However, the whole table have 10W partitions and growing, we have two
> > chioce: refresh the whole table or refresh the 7K partitions, we obvious
> > should select to refresh table, but It will take 5min to be finish, I
> check
> > the code(before 2.8.0) and find refreshing table will finally call the
> > function :
> >
> > HdfsTable.load(true,  client, msTbl, true, true, null);
> >
> > which will try to reload metadata and check every partition existing in
> the
> > table, and load eveny file to check whether file is updated or newly
> > created by checking last ModificationTime and file length.
> >
> > In our table, there are about 100W files, so the refresh table operation
> is
> > slowly.
> >
> > Hence, we create a new usage: REFRESH TABLE xxx PARTITION (day = ('xx1',
> > 'xx2', 'xx3'}); and the operation will just refresh partitions match the
> > day in (xx1/xx2/xx3), in this way, we can only load files and partitions
> in
> > the last 7 days.
> >
> > After our test, we find in this way, we speed the operation 2x times.
> >
> > Do you have any suggestion about it ?  Thanks a lot.
> >
>

Reply via email to