Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15818 )

Change subject: IMPALA-9512: Full ACID Milestone 2: Validate rows against the 
valid write id list
......................................................................


Patch Set 1:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/15818/1/be/src/exec/acid-metadata-utils.cc
File be/src/exec/acid-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/15818/1/be/src/exec/acid-metadata-utils.cc@113
PS1, Line 113:   for (int64_t i = min_write_id; i <= max_write_id; ++i) {
> Currently we don't open a transaction for table loading. And we fetch valid
We talked about it offline, so I just sum up the conclusion.

Hive Streaming Ingestion can write files that might contain row batches that 
Impala is not allowed to see.

https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2

Hive Streaming Ingest V2 doesn't write aborted rows to the data files.
When it commits a transaction it writes a new footer to the new end of the 
file. Hence different write ids go to different stripes => don't need to 
validate the write ids per row. It'd be enough to validate per stripe, but 
unfortunately Hive Streaming doesn't write statistics. Anyway, we can validate 
once per row batch.

When Hive Streaming writes new data to the file it also creates a side-file 
that contains the last committed file length. With the help of it we can find 
the latest footer. Impala cannot read such files currently, so I opened 
IMPALA-9723 to track this.


http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java@269
PS1, Line 269:      * TODO(boroknagyz): rename or remove
> TODO: resolve this TODO
Done


http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java
File fe/src/main/java/org/apache/impala/util/AcidUtils.java:

http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java@300
PS1, Line 300:         if (loadStats != null)
> nit: use braces in multi line blocks
Made it single-line


http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java@349
PS1, Line 349:             loadStats.filesSupercededByNewerBase++;
> nit: use braces in multi line blocks
Made it single-line


http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/main/java/org/apache/impala/util/AcidUtils.java@374
PS1, Line 374: if (parsedDelta.minWriteId <= baseWriteId)
> It is not possible for baseWriteId to be between minWriteId and maxWriteId,
Added only added precondition check because normally it shouldn't happen.


http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
File fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java:

http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@277
PS1, Line 277: transaction 10
> Don't you mean writeId 10? I don't see any txn id in the paths.
Done


http://gerrit.cloudera.org:8080/#/c/15818/1/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@356
PS1, Line 356:           "delta_0000006_0000007/00000"
> How do we know that this minor compaction was actually finished, and we are
We talked about it offline. Hive Streaming Ingest creates such delta 
directories.


http://gerrit.cloudera.org:8080/#/c/15818/1/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/15818/1/testdata/bin/generate-schema-statements.py@663
PS1, Line 663:       print "file_format=", file_format
             :       print "create_file_format=", create_file_format
> TODO: delete these
Done


http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/acid_txn.py
File tests/common/acid_txn.py:

http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/acid_txn.py@41
PS1, Line 41: # >>> at.get_open_txns()
> optional: for me it would be more logical to put this in util than to commo
Done


http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/impala_test_suite.py
File tests/common/impala_test_suite.py:

http://gerrit.cloudera.org:8080/#/c/15818/1/tests/common/impala_test_suite.py@620
PS1, Line 620:       print "TEST_SECTION", test_section
> TODO: delete this
Done


http://gerrit.cloudera.org:8080/#/c/15818/1/tests/query_test/test_acid_row_validation.py
File tests/query_test/test_acid_row_validation.py:

http://gerrit.cloudera.org:8080/#/c/15818/1/tests/query_test/test_acid_row_validation.py@73
PS1, Line 73:     Creates a table like 'base_table, synthetically creates a 
minor compacted
> nit: missing '
n/a since removed this method


http://gerrit.cloudera.org:8080/#/c/15818/1/tests/query_test/test_acid_row_validation.py@85
PS1, Line 85:     check_call(['hdfs', 'dfs', '-mkdir', '-p', 
compacted_delta_dir])
            :     check_call(['hdfs', 'dfs', '-cp'] + files + 
[compacted_delta_dir])
> using https://github.com/apache/impala/blob/master/tests/util/hdfs_util.py
I know, I initially used that but it's a bit annoying to use. E.g.:

 * _get_table_location() returns a URI that hdfs_client cannot use
 * For some calls it uses webhdfs_client, for some other calls it uses 
hdfs_filesystem_client. The first only accepts absolute paths, but without the 
leading "/". The latter can also use relative paths, so we need to add "/" if 
we want an absolute path. See make_dir() vs copy_from_local()

Anyway, I switched to the hdfs client because it might enable the test to run 
on different filesystems.



--
To view, visit http://gerrit.cloudera.org:8080/15818
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5ed74585a2d73ebbcee763b0545be4412926299d
Gerrit-Change-Number: 15818
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 04 May 2020 14:58:53 +0000
Gerrit-HasComments: Yes

Reply via email to