Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15395 )
Change subject: IMPALA-9042: Milestone 1: properly scan files that has full ACID schema ...................................................................... Patch Set 4: (21 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/15395/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15395/3//COMMIT_MSG@58 PS3, Line 58: * specific queries about hidden columns (full-acid-rowid.test) > Can you also add describe and describe formatted tests? I assume these show Added tests to 'describe-path.test'. http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/hdfs-orc-scanner.cc@181 PS3, Line 181: bool is_file_full_acid = reader_->hasMetadataValue("hive.acid.version") && > I assume you're going to remove the log lines? Done http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/hdfs-orc-scanner.cc@185 PS3, Line 185: return Status(Substitute("Error: Table is in full ACID format, but file " > This is going to be addressed at some point, right? So we can scan original Yes, "original files" are tracked with IMPALA-9515. Added TODO comment here. http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.h File be/src/exec/orc-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.h@46 PS3, Line 46: private: > Maybe IsFileFullAcid()? Since it's going to get confusing in future about w Removed this function. http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.cc File be/src/exec/orc-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.cc@94 PS3, Line 94: if (table_idx == num_part_cols + FILE_INDEX_OF_FIELD_ROW) { > I'd find it clearer if it was a named constant instead of the magic number Done http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.cc@101 PS3, Line 101: DCHECK > DCHECK_GE Done http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.cc@110 PS3, Line 110: else { > nit: This and the same @115 could be moved outside the if. Done http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.cc@118 PS3, Line 118: // Hence, in the table metadata this is a top-level column, i.e. it is offsetted > Could you explain the values here? Added some comments about it. Yes, file_idx should remain the same because only the first number is offsetted with 'num_part_cols'. http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/exec/orc-metadata-utils.cc@231 PS3, Line 231: type.DebugString(), orc_type.toString(), filename_)); > Can you add a brief comment on the approach? It seems to be basically "if i Looking at the ORC file metadata it turned out that the ACID version is also written there, so using that from now on instead of this duck approach. Do you think it's worth to keep this for verifying the ACID schema? http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/15395/3/be/src/runtime/descriptors.h@333 PS3, Line 333: IsTableFul > IsFullAcidTable? Renamed it to IsTableFullAcid() to make it similar to IsFileFullAcid() (though the latter was entirely removed since then) http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/AlterTableStmt.java File fe/src/main/java/org/apache/impala/analysis/AlterTableStmt.java: http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/AlterTableStmt.java@94 PS3, Line 94: // TODO: IMPALA-8831 will enable all ALTER TABLE statements on transactional tables. > ? Added a comment about it. We need ALTER TABLE ADD PARTITION and ALTER TABLE SORT BY during data loading so in the new PS I only enable those. ALTER TABLE stmts are disabled because they fail when the table has column statistics (IMPALA-8831). The enabled ALTER TABLE statements should be safe in those regard. Hopefully there'll be a separate patch for IMPALA-8831 that does this for us so we can remove this from the final version of this change request. http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/Analyzer.java File fe/src/main/java/org/apache/impala/analysis/Analyzer.java: http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@a297 PS3, Line 297: > Removing this mean that we support write operations on full ACID tables? Do ALTER TABLE and DROP TABLE calls this, but they are supported operations. http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@224 PS3, Line 224: r, table.getFullName() > There are now two ensureTableNotFullAcid methods, that only differ in the p Yeah I think we can remove these. http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@237 PS3, Line 237: > nit: maybe change this string to enum? I rather keep it this way because it's quite flexible and I don't want to introduce a new enum that contains all possible operations. http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@238 PS3, Line 238: * @throws AnalysisException If table is a bucketed table. > nit: missing space Done http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/Path.java File fe/src/main/java/org/apache/impala/analysis/Path.java: http://gerrit.cloudera.org:8080/#/c/15395/3/fe/src/main/java/org/apache/impala/analysis/Path.java@473 PS3, Line 473: convertToFullAcid > convertToFullAcidFilePath? Done http://gerrit.cloudera.org:8080/#/c/15395/3/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/15395/3/testdata/bin/generate-schema-statements.py@319 PS3, Line 319: ' > flake8: E129 visually indented line with same indent as next logical line Is there a way to get rid of these? It's also not true, because our next logical line is not indendted with 4 spaces. http://gerrit.cloudera.org:8080/#/c/15395/3/testdata/workloads/functional-query/queries/QueryTest/acid-negative.test File testdata/workloads/functional-query/queries/QueryTest/acid-negative.test: http://gerrit.cloudera.org:8080/#/c/15395/3/testdata/workloads/functional-query/queries/QueryTest/acid-negative.test@1 PS3, Line 1: ==== > Why commented out? In the new patch I only enable ADD PARTITION and SORT BY altering statements, so I enabled these tests again. See details at AlterTableStmt.java http://gerrit.cloudera.org:8080/#/c/15395/3/testdata/workloads/functional-query/queries/QueryTest/full-acid-rowid.test File testdata/workloads/functional-query/queries/QueryTest/full-acid-rowid.test: http://gerrit.cloudera.org:8080/#/c/15395/3/testdata/workloads/functional-query/queries/QueryTest/full-acid-rowid.test@16 PS3, Line 16: INT, BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, STRING, TIMESTAMP, INT, INT > Can you also add a LABELS section to make sure the labels for the ROW__ID c Done http://gerrit.cloudera.org:8080/#/c/15395/3/testdata/workloads/functional-query/queries/QueryTest/full-acid-rowid.test@17 PS3, Line 17: ==== > Do we expect the values to be deterministic across data loads? TL;DR: Yes, * operation: always 0 (INSERT) * originalTransaction: it's actually the write id of the txn that inserted the row, it starts from 1 * bucket: it's a bit-packed number: ** (bucket codec version) + (bucket ID) + (statement ID) ** bucket codec version is currently 1 ** bucket ID is parsed from the filename. If the table is not bucketed then it will be the ordinal of the file in the transactional directory ** statment ID: ordinal of statement in multi-statement transactions. * rowid: ordinal of row (within transaction and bucket) * currentTransaction: current write id, it only differs from originalTransaction when the operation is 2 (DELETE) http://gerrit.cloudera.org:8080/#/c/15395/4/tests/query_test/test_nested_types.py File tests/query_test/test_nested_types.py: http://gerrit.cloudera.org:8080/#/c/15395/4/tests/query_test/test_nested_types.py@211 PS4, Line 211: too much indent -- To view, visit http://gerrit.cloudera.org:8080/15395 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb Gerrit-Change-Number: 15395 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Norbert Luksa <norbert.lu...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Fri, 20 Mar 2020 16:39:07 +0000 Gerrit-HasComments: Yes