Gabriel Gatto created IMPALA-8167:
-------------------------------------
Summary: Refresh´s on NON-partitioned tables ALWAYS reads all the
files block locations taking too long on BIG TABLES.
Key: IMPALA-8167
URL: https://issues.apache.org/jira/browse/IMPALA-8167
Project: IMPALA
Issue Type: Bug
Components: Catalog
Affects Versions: Impala 2.12.0
Reporter: Gabriel Gatto
REFRESH's on NON-PARTITIONED tables always fetches their block locations using
the "getFileBlockLocations" method on all files, no matter if there are new
files or not.
We think the problem is located in the method "updateUnpartitionedTableFileMd".
This method always resets partitions and adds a "new one" with NO
FILEDESCRIPTORS. So the method refreshPartitionFileMetadata(part), always needs
to read all the files of the new partition to rebuild the information. This
behaviour causes that getBlockLocation is always call for all the files,
despite they are new or old.
This is confirmed by looking at the code:
{color:#000000} private void updateUnpartitionedTableFileMd() throws Exception
{{color}
{color:#000000} if (LOG.isTraceEnabled()) {{color}
{color:#000000} LOG.trace("update unpartitioned table: " +
getFullName());{color}
{color:#000000} }{color}
{color:#000000} resetPartitions();{color} ---> DROP PARTITION WITH PREVIOUS
FILEDESCRIPTOR INFO.
{color:#000000} org.apache.hadoop.hive.metastore.api.Table msTbl =
getMetaStoreTable();{color}
{color:#000000} Preconditions.checkNotNull(msTbl);{color}
{color:#000000} addDefaultPartition(msTbl.getSd());{color}
{color:#000000} HdfsPartition part = createPartition(msTbl.getSd(),
null);{color} ---> CREATES NEW PARTITION.
{color:#000000} addPartition(part);{color}
{color:#000000} if (isMarkedCached_) part.markCached();{color}
{color:#000000} LOG.info("Refreshing-updateUnpartitionedTableFileMd(): " +
getFullName() + {color}
{color:#000000} " Location: " + part.getLocation() +{color}
{color:#000000} " FileDescriptors: " +
part.getFileDescriptors().size());{color}
{color:#000000} refreshPartitionFileMetadata(part);{color}
{color:#000000} LOG.info("Refreshed-updateUnpartitionedTableFileMd(): " +
getFullName() + {color}
{color:#000000} " Location: " + part.getLocation() +{color}
{color:#000000} " FileDescriptors: " +
part.getFileDescriptors().size());{color}
{color:#000000} }{color}
Running examples:
1) The first run after no files added or changed{color:#6e6e73} .{color}
{color:#6e6e73}{color:#000000}[vera05.claro.amx:21000] > refresh
prod_ar.aux_tas_call_details_rt02; {color}{color}
{color:#6e6e73}{color:#000000}LOG:{color}{color}
I0206 11:18:16.581826 34494 HdfsTable.java:1333]
Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02
Location:
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux
*FileDescriptors: 0*
I0206 11:25:35.748185 34494 HdfsTable.java:1340]
Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02
Location:
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux
*FileDescriptors: 148398*
2) Second run 2 min after the other with no files added or changed in the
middle. In this case we see that no filedescriptors exists because of the
resetPartitions(), so it needs to read all the files again.
{color:#6e6e73}{color:#000000}[vera05.claro.amx:21000] > refresh
prod_ar.aux_tas_call_details_rt02;{color} {color}
{color:#6e6e73}{color:#000000}LOG:{color}{color}
{color:#000000}I0206 11:27:54.086167 33902 HdfsTable.java:1333]
Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02
Location:
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux
*FileDescriptors: 0*{color}
{color:#000000}I0206 11:36:35.344233 33902 HdfsTable.java:1340]
Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02
Location:
hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux
*FileDescriptors: 148398*{color}
{color:#6e6e73} {color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)