[
https://issues.apache.org/jira/browse/IMPALA-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489304#comment-16489304
]
bharath v commented on IMPALA-6119:
-----------------------------------
Thanks for digging into this. I vaguely remember that we considered the above
solutions and like you mentioned (1) is more computationally heavy and (2)
adds more memory and the Catalog is already notorious for its memory usage.:)
What are your on the fix I proposed? We make exploit the references and update
the source fds directly.
> Inconsistent file metadata updates when multiple partitions point to the same
> path
> ----------------------------------------------------------------------------------
>
> Key: IMPALA-6119
> URL: https://issues.apache.org/jira/browse/IMPALA-6119
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0
> Reporter: bharath v
> Assignee: Gabor Kaszab
> Priority: Critical
> Labels: correctness, ramp-up
>
> Following steps can give inconsistent results.
> {noformat}
> // Create a partitioned table
> create table test(a int) partitioned by (b int);
> // Create two partitions b=1/b=2 mapped to the same HDFS location.
> insert into test partition(b=1) values (1);
> alter table test add partition (b=2) location
> 'hdfs://localhost:20500/test-warehouse/test/b=1/'
> [localhost:21000] > show partitions test;
> Query: show partitions test
> +-------+-------+--------+------+--------------+-------------------+--------+-------------------+------------------------------------------------+
> | b | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |
> Incremental stats | Location |
> +-------+-------+--------+------+--------------+-------------------+--------+-------------------+------------------------------------------------+
> | 1 | -1 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT |
> false | hdfs://localhost:20500/test-warehouse/test/b=1 |
> | 2 | -1 | 1 | 2B | NOT CACHED | NOT CACHED | TEXT |
> false | hdfs://localhost:20500/test-warehouse/test/b=1 |
> | Total | -1 | 2 | 4B | 0B | | |
> | |
> +-------+-------+--------+------+--------------+-------------------+--------+-------------------+------------------------------------------------+
> // Insert new data into one of the partitions
> insert into test partition(b=1) values (2);
> // Newly added file is reflected only in the added partition files.
> show files in test;
> Query: show files in test
> +----------------------------------------------------------------------------------------------------+------+-----------+
> | Path
> | Size | Partition |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> |
> hdfs://localhost:20500/test-warehouse/test/b=1/2e44cd49e8c3d30d-572fc97800000000_627280230_data.0.
> | 2B | b=1 |
> |
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
> | 2B | b=1 |
> |
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
> | 2B | b=2 |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> invalidate metadata test;
> show files in test;
> // After invalidation, the newly added file now shows up in both the
> partitions.
> Query: show files in test
> +----------------------------------------------------------------------------------------------------+------+-----------+
> | Path
> | Size | Partition |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> |
> hdfs://localhost:20500/test-warehouse/test/b=1/2e44cd49e8c3d30d-572fc97800000000_627280230_data.0.
> | 2B | b=1 |
> |
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
> | 2B | b=1 |
> |
> hdfs://localhost:20500/test-warehouse/test/b=1/2e44cd49e8c3d30d-572fc97800000000_627280230_data.0.
> | 2B | b=2 |
> |
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
> | 2B | b=2 |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> {noformat}
> So, depending whether the user invalidates the table, they can see different
> results. The bug is in the following code.
> {noformat}
> private FileMetadataLoadStats resetAndLoadFileMetadata(
> Path partDir, List<HdfsPartition> partitions) throws IOException {
> FileMetadataLoadStats loadStats = new FileMetadataLoadStats(partDir);
> ....
> ....
> ....
> for (HdfsPartition partition: partitions)
> partition.setFileDescriptors(newFileDescs); <======
> {noformat}
> We only update the added file metadata for the new partition (copy-on-write
> way). Instead we should update the source descriptors so that it is reflected
> in the other partitions too.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]