[jira] [Created] (IMPALA-8167) Refresh´s on NON-partitioned tables ALWAYS reads all the files block locations taking too long on BIG TABLES.

Gabriel Gatto (JIRA) Wed, 06 Feb 2019 07:39:56 -0800

Gabriel Gatto created IMPALA-8167:
-------------------------------------

             Summary: Refresh´s on NON-partitioned tables ALWAYS reads all the 
files block locations taking too long on BIG TABLES.
                 Key: IMPALA-8167
                 URL: https://issues.apache.org/jira/browse/IMPALA-8167
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog
    Affects Versions: Impala 2.12.0
            Reporter: Gabriel Gatto



REFRESH's on NON-PARTITIONED tables always fetches their block locations using 
the "getFileBlockLocations" method on all files, no matter if there are new 
files or not.

We think the problem is located in the method "updateUnpartitionedTableFileMd".

This method always resets partitions and adds a "new one" with NO 
FILEDESCRIPTORS. So the method refreshPartitionFileMetadata(part), always needs 
to read all the files of the new partition to rebuild the information. This 
behaviour causes that getBlockLocation is always call for all the files, 
despite they are new or old.

This is confirmed by looking at the code:

 

{color:#000000}  private void updateUnpartitionedTableFileMd() throws Exception 
{{color} 
{color:#000000}    if (LOG.isTraceEnabled()) {{color} 
{color:#000000}      LOG.trace("update unpartitioned table: " + 
getFullName());{color} 
{color:#000000}    }{color} 
{color:#000000}    resetPartitions();{color}  ---> DROP PARTITION WITH PREVIOUS 
FILEDESCRIPTOR INFO. 
{color:#000000}    org.apache.hadoop.hive.metastore.api.Table msTbl = 
getMetaStoreTable();{color} 
{color:#000000}    Preconditions.checkNotNull(msTbl);{color} 
{color:#000000}    addDefaultPartition(msTbl.getSd());{color} 
{color:#000000}    HdfsPartition part = createPartition(msTbl.getSd(), 
null);{color} ---> CREATES NEW PARTITION.
{color:#000000}    addPartition(part);{color} 
{color:#000000}    if (isMarkedCached_) part.markCached();{color}

{color:#000000}    LOG.info("Refreshing-updateUnpartitionedTableFileMd(): " + 
getFullName() + {color} 
{color:#000000}              " Location: " + part.getLocation() +{color} 
{color:#000000}              " FileDescriptors: " + 
part.getFileDescriptors().size());{color}

{color:#000000}    refreshPartitionFileMetadata(part);{color}

{color:#000000}    LOG.info("Refreshed-updateUnpartitionedTableFileMd(): " + 
getFullName() + {color} 
{color:#000000}             " Location: " + part.getLocation() +{color} 
{color:#000000}             " FileDescriptors: " + 
part.getFileDescriptors().size());{color} 
{color:#000000}  }{color}  

 

Running examples:

1) The first run after no files added or changed{color:#6e6e73} .{color}

{color:#6e6e73}{color:#000000}[vera05.claro.amx:21000] > refresh 
prod_ar.aux_tas_call_details_rt02; {color}{color}

{color:#6e6e73}{color:#000000}LOG:{color}{color}

I0206 11:18:16.581826 34494 HdfsTable.java:1333] 
Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 
Location: 
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux 
*FileDescriptors: 0* 

I0206 11:25:35.748185 34494 HdfsTable.java:1340] 
Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 
Location: 
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux 
*FileDescriptors: 148398*

 

2) Second run 2 min after the other with no files added or changed in the 
middle. In this case we see that no filedescriptors exists because of the 
resetPartitions(), so it needs to read all the files again.

{color:#6e6e73}{color:#000000}[vera05.claro.amx:21000] > refresh 
prod_ar.aux_tas_call_details_rt02;{color} {color}

{color:#6e6e73}{color:#000000}LOG:{color}{color}

 

{color:#000000}I0206 11:27:54.086167 33902 HdfsTable.java:1333] 
Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 
Location: 
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux 
*FileDescriptors: 0*{color} 

{color:#000000}I0206 11:36:35.344233 33902 HdfsTable.java:1340] 
Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 
Location: 
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux 
*FileDescriptors: 148398*{color} 

{color:#6e6e73} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (IMPALA-8167) Refresh´s on NON-partitioned tables ALWAYS reads all the files block locations taking too long on BIG TABLES.

Reply via email to