Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15818 )
Change subject: IMPALA-9512: Full ACID Milestone 2: Validate rows against the valid write id list ...................................................................... Patch Set 1: (12 comments) http://gerrit.cloudera.org:8080/#/c/15818/1/be/src/exec/acid-metadata-utils.cc File be/src/exec/acid-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/15818/1/be/src/exec/acid-metadata-utils.cc@113 PS1, Line 113: for (int64_t i = min_write_id; i <= max_write_id; ++i) { > Currently we don't open a transaction for table loading. And we fetch valid We talked about it offline, so I just sum up the conclusion. Hive Streaming Ingestion can write files that might contain row batches that Impala is not allowed to see. https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2 Hive Streaming Ingest V2 doesn't write aborted rows to the data files. When it commits a transaction it writes a new footer to the new end of the file. Hence different write ids go to different stripes => don't need to validate the write ids per row. It'd be enough to validate per stripe, but unfortunately Hive Streaming doesn't write statistics. Anyway, we can validate once per row batch. When Hive Streaming writes new data to the file it also creates a side-file that contains the last committed file length. With the help of it we can find the latest footer. Impala cannot read such files currently, so I opened IMPALA-9723 to track this. http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java: http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@269 PS1, Line 269: * TODO(boroknagyz): rename or remove > TODO: resolve this TODO Done http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java File fe/src/main/java/org/apache/impala/util/AcidUtils.java: http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java@300 PS1, Line 300: if (loadStats != null) > nit: use braces in multi line blocks Made it single-line http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java@349 PS1, Line 349: loadStats.filesSupercededByNewerBase++; > nit: use braces in multi line blocks Made it single-line http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java@374 PS1, Line 374: if (parsedDelta.minWriteId <= baseWriteId) > It is not possible for baseWriteId to be between minWriteId and maxWriteId, Added only added precondition check because normally it shouldn't happen. http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java File fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java: http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@277 PS1, Line 277: transaction 10 > Don't you mean writeId 10? I don't see any txn id in the paths. Done http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@356 PS1, Line 356: "delta_0000006_0000007/00000" > How do we know that this minor compaction was actually finished, and we are We talked about it offline. Hive Streaming Ingest creates such delta directories. http://gerrit.cloudera.org:8080/#/c/15818/1/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/15818/1/testdata/bin/generate-schema-statements.py@663 PS1, Line 663: print "file_format=", file_format : print "create_file_format=", create_file_format > TODO: delete these Done http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/acid_txn.py File tests/common/acid_txn.py: http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/acid_txn.py@41 PS1, Line 41: # >>> at.get_open_txns() > optional: for me it would be more logical to put this in util than to commo Done http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/impala_test_suite.py File tests/common/impala_test_suite.py: http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/impala_test_suite.py@620 PS1, Line 620: print "TEST_SECTION", test_section > TODO: delete this Done http://gerrit.cloudera.org:8080/#/c/15818/1/tests/query_test/test_acid_row_validation.py File tests/query_test/test_acid_row_validation.py: http://gerrit.cloudera.org:8080/#/c/15818/1/tests/query_test/test_acid_row_validation.py@73 PS1, Line 73: Creates a table like 'base_table, synthetically creates a minor compacted > nit: missing ' n/a since removed this method http://gerrit.cloudera.org:8080/#/c/15818/1/tests/query_test/test_acid_row_validation.py@85 PS1, Line 85: check_call(['hdfs', 'dfs', '-mkdir', '-p', compacted_delta_dir]) : check_call(['hdfs', 'dfs', '-cp'] + files + [compacted_delta_dir]) > using https://github.com/apache/impala/blob/master/tests/util/hdfs_util.py I know, I initially used that but it's a bit annoying to use. E.g.: * _get_table_location() returns a URI that hdfs_client cannot use * For some calls it uses webhdfs_client, for some other calls it uses hdfs_filesystem_client. The first only accepts absolute paths, but without the leading "/". The latter can also use relative paths, so we need to add "/" if we want an absolute path. See make_dir() vs copy_from_local() Anyway, I switched to the hdfs client because it might enable the test to run on different filesystems. -- To view, visit http://gerrit.cloudera.org:8080/15818 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d Gerrit-Change-Number: 15818 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 04 May 2020 14:58:53 +0000 Gerrit-HasComments: Yes
