Repository: incubator-impala Updated Branches: refs/heads/master cc8a11983 -> 0467b0d54
IMPALA-4725: Query option to control Parquet array resolution. Summary of changes: Introduces a new query option PARQUET_ARRAY_RESOLUTION to control the path-resolution behavior for Parquet files with nested arrays. The values are: - THREE_LEVEL Assumes arrays are encoded with the 3-level representation. Also resolves arrays encoded with a single level. Does not attempt a 2-level resolution. - TWO_LEVEL Assumes arrays are encoded with the 2-level representation. Also resolves arrays encoded with a single level. Does not attempt a 3-level resolution. - TWO_LEVEL_THEN_THREE_LEVEL First tries to resolve assuming the 2-level representation, and if unsuccessful, tries the 3-level representation. Also resolves arrays encoded with a single level. This is the current Impala behavior and is used as the default value for compatibility. Note that 'failure' to resolve a schema path with a given array-resolution policy does not necessarily mean a warning or error is returned by the query. A mismatch might be treated like a missing field which is necessary to support schema evolution. There is no way to reliably distinguish the 'bad resolution' and 'legitimately missing field' cases. The new query option is independent of and can be combined with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION. Background: Arrays can be represented in several ways in Parquet: - Three Level Encoding (standard) - Two Level Encoding (legacy) - One Level Encoding (legacy) More details are in the "Lists" section of the spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md Unfortunately, there is no reliable metadata within Parquet files to indicate which encoding was used. There is even the possibility of having mixed encodings within the same file if there are multiple arrays. As a result, Impala currently tries to auto-detect the file encoding when resolving a schema path in a Parquet file using the TWO_LEVEL_THEN_THREE_LEVEL policy. However, regardless of whether a Parquet data file uses the 2-level or 3-level encoding, the index-based resolution may return incorrect results if the representation in the Parquet file does not exactly match the attempted array-resoution policy. Intuitively, when attempting a 2-level resolution on a 3-level file, the matched schema node may not be deep enough in the schema tree, but could still be a scalar node with expected type. Similarly, when attempting a 3-level resolution on a 2-level file a level may be incorrectly skipped. The name-based policy generally does not have this problem because it avoids traversing incorrect schema paths. However, the index-based resoution allows a different set of schema-evolution operations, so just using name-based resolution is not an acceptable workaround in all cases. Testing: - Added new Parquet data files that show how incorrect results can be returned with a mismatched file encoding and resolution policy. Added both 2-level and 3-level versions of the data. - Added a new test in test_nested_types.py that shows the behavior with the new PARQUET_ARRAY_RESOLUTION query option. - Locally ran test_scanners.py and test_nested_types.py on core. Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907 Reviewed-on: http://gerrit.cloudera.org:8080/6250 Reviewed-by: Alex Behm <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/7d8acee8 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/7d8acee8 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/7d8acee8 Branch: refs/heads/master Commit: 7d8acee81437c040255c3219bcd3e72f10b3b772 Parents: cc8a119 Author: Alex Behm <[email protected]> Authored: Fri Feb 24 19:04:59 2017 -0800 Committer: Impala Public Jenkins <[email protected]> Committed: Thu Mar 9 05:07:44 2017 +0000 ---------------------------------------------------------------------- be/src/exec/hdfs-parquet-scanner.cc | 3 +- be/src/exec/parquet-metadata-utils.cc | 62 +++++++----- be/src/exec/parquet-metadata-utils.h | 36 ++++--- be/src/service/query-options.cc | 20 ++++ be/src/service/query-options.h | 3 +- common/thrift/ImpalaInternalService.thrift | 12 +++ common/thrift/ImpalaService.thrift | 12 +++ .../AmbiguousList.avsc | 19 ++++ .../AmbiguousList.json | 6 ++ .../AmbiguousList_Legacy.parquet | Bin 0 -> 1089 bytes .../AmbiguousList_Modern.parquet | Bin 0 -> 1130 bytes testdata/parquet_nested_types_encodings/README | 29 ++++++ .../parquet-ambiguous-list-legacy.test | 71 ++++++++++++++ .../parquet-ambiguous-list-modern.test | 96 +++++++++++++++++++ tests/query_test/test_nested_types.py | 83 ++++++++++++++++ 15 files changed, 412 insertions(+), 40 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/exec/hdfs-parquet-scanner.cc ---------------------------------------------------------------------- diff --git a/be/src/exec/hdfs-parquet-scanner.cc b/be/src/exec/hdfs-parquet-scanner.cc index bad6b33..f21ccff 100644 --- a/be/src/exec/hdfs-parquet-scanner.cc +++ b/be/src/exec/hdfs-parquet-scanner.cc @@ -246,7 +246,8 @@ Status HdfsParquetScanner::Open(ScannerContext* context) { // Parse the file schema into an internal representation for schema resolution. schema_resolver_.reset(new ParquetSchemaResolver(*scan_node_->hdfs_table(), - state_->query_options().parquet_fallback_schema_resolution)); + state_->query_options().parquet_fallback_schema_resolution, + state_->query_options().parquet_array_resolution)); RETURN_IF_ERROR(schema_resolver_->Init(&file_metadata_, filename())); // We've processed the metadata and there are columns that need to be materialized. http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/exec/parquet-metadata-utils.cc ---------------------------------------------------------------------- diff --git a/be/src/exec/parquet-metadata-utils.cc b/be/src/exec/parquet-metadata-utils.cc index 2527219..931161a 100644 --- a/be/src/exec/parquet-metadata-utils.cc +++ b/be/src/exec/parquet-metadata-utils.cc @@ -40,6 +40,14 @@ using boost::algorithm::token_compress_on; namespace impala { +// Needs to be in sync with the order of enum values declared in TParquetArrayResolution. +const std::vector<ParquetSchemaResolver::ArrayEncoding> + ParquetSchemaResolver::ORDERED_ARRAY_ENCODINGS[] = + {{ParquetSchemaResolver::THREE_LEVEL, ParquetSchemaResolver::ONE_LEVEL}, + {ParquetSchemaResolver::TWO_LEVEL, ParquetSchemaResolver::ONE_LEVEL}, + {ParquetSchemaResolver::TWO_LEVEL, ParquetSchemaResolver::THREE_LEVEL, + ParquetSchemaResolver::ONE_LEVEL}}; + Status ParquetMetadataUtils::ValidateFileVersion( const parquet::FileMetaData& file_metadata, const char* filename) { if (file_metadata.version > PARQUET_CURRENT_VERSION) { @@ -384,39 +392,41 @@ Status ParquetSchemaResolver::CreateSchemaTree( Status ParquetSchemaResolver::ResolvePath(const SchemaPath& path, SchemaNode** node, bool* pos_field, bool* missing_field) const { *missing_field = false; - // First try two-level array encoding. - bool missing_field_two_level; - Status status_two_level = - ResolvePathHelper(TWO_LEVEL, path, node, pos_field, &missing_field_two_level); - if (missing_field_two_level) DCHECK(status_two_level.ok()); - if (status_two_level.ok() && !missing_field_two_level) return Status::OK(); - // The two-level resolution failed or reported a missing field, try three-level array - // encoding. - bool missing_field_three_level; - Status status_three_level = - ResolvePathHelper(THREE_LEVEL, path, node, pos_field, &missing_field_three_level); - if (missing_field_three_level) DCHECK(status_three_level.ok()); - if (status_three_level.ok() && !missing_field_three_level) return Status::OK(); - // The three-level resolution failed or reported a missing field, try one-level array - // encoding. - bool missing_field_one_level; - Status status_one_level = - ResolvePathHelper(ONE_LEVEL, path, node, pos_field, &missing_field_one_level); - if (missing_field_one_level) DCHECK(status_one_level.ok()); - if (status_one_level.ok() && !missing_field_one_level) return Status::OK(); + const vector<ArrayEncoding>& ordered_array_encodings = + ORDERED_ARRAY_ENCODINGS[array_resolution_]; + + bool any_missing_field = false; + Status statuses[NUM_ARRAY_ENCODINGS]; + for (const auto& array_encoding: ordered_array_encodings) { + bool current_missing_field; + statuses[array_encoding] = ResolvePathHelper( + array_encoding, path, node, pos_field, ¤t_missing_field); + if (current_missing_field) DCHECK(statuses[array_encoding].ok()); + if (statuses[array_encoding].ok() && !current_missing_field) return Status::OK(); + any_missing_field = any_missing_field || current_missing_field; + } // None of resolutions yielded a node. Set *missing_field to true if any of the // resolutions reported a missing a field. - if (missing_field_one_level || missing_field_two_level || missing_field_three_level) { + if (any_missing_field) { *node = NULL; *missing_field = true; return Status::OK(); } - // All resolutions failed. Log and return the status from the three-level resolution - // (which is technically the standard). - DCHECK(!status_one_level.ok() && !status_two_level.ok() && !status_three_level.ok()); + + // All resolutions failed. Log and return the most relevant status. The three-level + // encoding is the Parquet standard, so always prefer that. Prefer the two-level over + // the one-level because the two-level can be specifically selected via a query option. + Status error_status = Status::OK(); + for (int i = THREE_LEVEL; i >= ONE_LEVEL; --i) { + if (!statuses[i].ok()) { + error_status = statuses[i]; + break; + } + } + DCHECK(!error_status.ok()); *node = NULL; - VLOG_QUERY << status_three_level.msg().msg() << "\n" << GetStackTrace(); - return status_three_level; + VLOG_QUERY << error_status.msg().msg() << "\n" << GetStackTrace(); + return error_status; } Status ParquetSchemaResolver::ResolvePathHelper(ArrayEncoding array_encoding, http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/exec/parquet-metadata-utils.h ---------------------------------------------------------------------- diff --git a/be/src/exec/parquet-metadata-utils.h b/be/src/exec/parquet-metadata-utils.h index fdfc463..5e655b6 100644 --- a/be/src/exec/parquet-metadata-utils.h +++ b/be/src/exec/parquet-metadata-utils.h @@ -121,12 +121,16 @@ struct SchemaNode { /// Utility class to resolve SchemaPaths (e.g., from a table descriptor) against a /// Parquet file schema. Supports resolution by field index or by field name. +/// Supports different policies for resolving nested arrays based on the modern +/// three-level encoding or the legacy encodings (one and two level). class ParquetSchemaResolver { public: ParquetSchemaResolver(const HdfsTableDescriptor& tbl_desc, - TParquetFallbackSchemaResolution::type fallback_schema_resolution) + TParquetFallbackSchemaResolution::type fallback_schema_resolution, + TParquetArrayResolution::type array_resolution) : tbl_desc_(tbl_desc), fallback_schema_resolution_(fallback_schema_resolution), + array_resolution_(array_resolution), filename_(NULL) { } @@ -146,16 +150,32 @@ class ParquetSchemaResolver { /// 'pos_field' is set to false. Returns a non-OK status if 'path' cannot be resolved /// against the file's schema (e.g., unrecognized collection schema). /// - /// Tries to resolve assuming either two- or three-level array encoding in - /// 'schema_'. Returns a bad status if resolution fails in both cases. + /// Tries to resolve fields within lists according to the 'ordered_array_encodings_'. + /// Returns a bad status if resolution fails for all attempted array encodings. Status ResolvePath(const SchemaPath& path, SchemaNode** node, bool* pos_field, bool* missing_field) const; private: + /// The 'array_encoding' parameter determines whether to assume one-, two-, or + /// three-level array encoding. The returned status is not logged (i.e. it's an expected + /// error). + enum ArrayEncoding { + ONE_LEVEL, + TWO_LEVEL, + THREE_LEVEL, + NUM_ARRAY_ENCODINGS + }; + /// An arbitrary limit on the number of children per schema node we support. /// Used to sanity-check Parquet schemas. static const int SCHEMA_NODE_CHILDREN_SANITY_LIMIT = 64 * 1024; + /// Maps from the array-resolution policy to the ordered array encodings that should + /// be tried during path resolution. All entries have the ONE_LEVEL encoding at the end + /// because there is no ambiguity between the one-level and the other encodings (there + /// is no harm in trying it). + static const std::vector<ArrayEncoding> ORDERED_ARRAY_ENCODINGS[]; + /// Unflattens the schema metadata from a Parquet file metadata and converts it to our /// SchemaNode representation. Returns the result in 'node' unless an error status is /// returned. Does not set the slot_desc field of any SchemaNode. @@ -167,15 +187,6 @@ class ParquetSchemaResolver { int max_def_level, int max_rep_level, int ira_def_level, int* idx, int* col_idx, SchemaNode* node) const; - /// The 'array_encoding' parameter determines whether to assume one-, two-, or - /// three-level array encoding. The returned status is not logged (i.e. it's an expected - /// error). - enum ArrayEncoding { - ONE_LEVEL, - TWO_LEVEL, - THREE_LEVEL - }; - Status ResolvePathHelper(ArrayEncoding array_encoding, const SchemaPath& path, SchemaNode** node, bool* pos_field, bool* missing_field) const; @@ -208,6 +219,7 @@ class ParquetSchemaResolver { const HdfsTableDescriptor& tbl_desc_; const TParquetFallbackSchemaResolution::type fallback_schema_resolution_; + const TParquetArrayResolution::type array_resolution_; const char* filename_; /// Root node of our internal schema representation populated in Init(). http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/service/query-options.cc ---------------------------------------------------------------------- diff --git a/be/src/service/query-options.cc b/be/src/service/query-options.cc index 494d98e..1df0aae 100644 --- a/be/src/service/query-options.cc +++ b/be/src/service/query-options.cc @@ -396,6 +396,26 @@ Status impala::SetQueryOption(const string& key, const string& value, } break; } + case TImpalaQueryOptions::PARQUET_ARRAY_RESOLUTION: { + if (iequals(value, "three_level") || + value == to_string(TParquetArrayResolution::THREE_LEVEL)) { + query_options->__set_parquet_array_resolution( + TParquetArrayResolution::THREE_LEVEL); + } else if (iequals(value, "two_level") || + value == to_string(TParquetArrayResolution::TWO_LEVEL)) { + query_options->__set_parquet_array_resolution( + TParquetArrayResolution::TWO_LEVEL); + } else if (iequals(value, "two_level_then_three_level") || + value == to_string(TParquetArrayResolution::TWO_LEVEL_THEN_THREE_LEVEL)) { + query_options->__set_parquet_array_resolution( + TParquetArrayResolution::TWO_LEVEL_THEN_THREE_LEVEL); + } else { + return Status(Substitute("Invalid PARQUET_ARRAY_RESOLUTION option: '$0'. " + "Valid options are 'THREE_LEVEL', 'TWO_LEVEL' and " + "'TWO_LEVEL_THEN_THREE_LEVEL'.", value)); + } + break; + } case TImpalaQueryOptions::MT_DOP: { StringParser::ParseResult result; const int32_t dop = http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/service/query-options.h ---------------------------------------------------------------------- diff --git a/be/src/service/query-options.h b/be/src/service/query-options.h index f5220ea..ddb1b23 100644 --- a/be/src/service/query-options.h +++ b/be/src/service/query-options.h @@ -35,7 +35,7 @@ class TQueryOptions; // the DCHECK. #define QUERY_OPTS_TABLE\ DCHECK_EQ(_TImpalaQueryOptions_VALUES_TO_NAMES.size(),\ - TImpalaQueryOptions::PARQUET_DICTIONARY_FILTERING + 1);\ + TImpalaQueryOptions::PARQUET_ARRAY_RESOLUTION + 1);\ QUERY_OPT_FN(abort_on_default_limit_exceeded, ABORT_ON_DEFAULT_LIMIT_EXCEEDED)\ QUERY_OPT_FN(abort_on_error, ABORT_ON_ERROR)\ QUERY_OPT_FN(allow_unsupported_formats, ALLOW_UNSUPPORTED_FORMATS)\ @@ -89,6 +89,7 @@ class TQueryOptions; QUERY_OPT_FN(enable_expr_rewrites, ENABLE_EXPR_REWRITES)\ QUERY_OPT_FN(decimal_v2, DECIMAL_V2)\ QUERY_OPT_FN(parquet_dictionary_filtering, PARQUET_DICTIONARY_FILTERING)\ + QUERY_OPT_FN(parquet_array_resolution, PARQUET_ARRAY_RESOLUTION)\ ; http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/common/thrift/ImpalaInternalService.thrift ---------------------------------------------------------------------- diff --git a/common/thrift/ImpalaInternalService.thrift b/common/thrift/ImpalaInternalService.thrift index ddf137a..433f3b4 100644 --- a/common/thrift/ImpalaInternalService.thrift +++ b/common/thrift/ImpalaInternalService.thrift @@ -49,6 +49,14 @@ enum TParquetFallbackSchemaResolution { NAME } +// The order of the enum values needs to be kepy in sync with +// ParquetMetadataUtils::ORDERED_ARRAY_ENCODINGS in parquet-metadata-utils.cc. +enum TParquetArrayResolution { + THREE_LEVEL, + TWO_LEVEL, + TWO_LEVEL_THEN_THREE_LEVEL +} + // Query options that correspond to ImpalaService.ImpalaQueryOptions, with their // respective defaults. Query options can be set in the following ways: // @@ -224,6 +232,10 @@ struct TQueryOptions { // Indicates whether to use dictionary filtering for Parquet files 53: optional bool parquet_dictionary_filtering = true + + // Policy for resolving nested array fields in Parquet files. + 54: optional TParquetArrayResolution parquet_array_resolution = + TParquetArrayResolution.TWO_LEVEL_THEN_THREE_LEVEL } // Impala currently has two types of sessions: Beeswax and HiveServer2 http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/common/thrift/ImpalaService.thrift ---------------------------------------------------------------------- diff --git a/common/thrift/ImpalaService.thrift b/common/thrift/ImpalaService.thrift index 28ee3df..87832ad 100644 --- a/common/thrift/ImpalaService.thrift +++ b/common/thrift/ImpalaService.thrift @@ -254,6 +254,18 @@ enum TImpalaQueryOptions { // Indicates whether to use dictionary filtering for Parquet files PARQUET_DICTIONARY_FILTERING, + + // Policy for resolving nested array fields in Parquet files. + // An Impala array type can have several different representations in + // a Parquet schema (three, two, or one level). There is fundamental ambiguity + // between the two and three level encodings with index-based field resolution. + // The ambiguity can manually be resolved using this query option, or by using + // PARQUET_FALLBACK_SCHEMA_RESOLUTION=name. + // The value TWO_LEVEL_THEN_THREE_LEVEL was the default mode since Impala 2.3. + // It is preserved as the default for compatibility. + // TODO: Remove the TWO_LEVEL_THEN_THREE_LEVEL mode completely or at least make + // it non-default in a compatibility breaking release. + PARQUET_ARRAY_RESOLUTION } // The summary of a DML statement. http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList.avsc ---------------------------------------------------------------------- diff --git a/testdata/parquet_nested_types_encodings/AmbiguousList.avsc b/testdata/parquet_nested_types_encodings/AmbiguousList.avsc new file mode 100644 index 0000000..d53089b --- /dev/null +++ b/testdata/parquet_nested_types_encodings/AmbiguousList.avsc @@ -0,0 +1,19 @@ +{"type": "record", +"namespace": "org.apache.impala", +"name": "nested", +"fields": [ + {"name": "ambigArray", "type": { + "type": "array", "items": { + "name": "r1", "type": "record", "fields": [ + {"name": "s2", "type": + {"name": "r1s2", "type": "record", "fields": [ + {"name": "f21", "type": ["int", "null"]}, + {"name": "f22", "type": ["int", "null"]} + ]} + }, + {"name": "F11", "type": ["int", "null"]}, + {"name": "F12", "type": ["int", "null"]} + ] + } + }} +]} http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList.json ---------------------------------------------------------------------- diff --git a/testdata/parquet_nested_types_encodings/AmbiguousList.json b/testdata/parquet_nested_types_encodings/AmbiguousList.json new file mode 100644 index 0000000..04733bc --- /dev/null +++ b/testdata/parquet_nested_types_encodings/AmbiguousList.json @@ -0,0 +1,6 @@ +[{ + "ambigArray": [ + {"s2": {"f21": 21, "f22": 22}, "F11": 11, "F12": 12}, + {"s2": {"f21": 210, "f22": 220}, "F11": 110, "F12": 120} + ] +}] http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet ---------------------------------------------------------------------- diff --git a/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet b/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet new file mode 100644 index 0000000..fa6574e Binary files /dev/null and b/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet differ http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet ---------------------------------------------------------------------- diff --git a/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet b/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet new file mode 100644 index 0000000..211a500 Binary files /dev/null and b/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet differ http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/README ---------------------------------------------------------------------- diff --git a/testdata/parquet_nested_types_encodings/README b/testdata/parquet_nested_types_encodings/README new file mode 100644 index 0000000..3b46f1d --- /dev/null +++ b/testdata/parquet_nested_types_encodings/README @@ -0,0 +1,29 @@ +The Parquet files AmbiguousList_Modern.parquet and AmbiguousList_Legacy.parquet were generated +using the kite script located here: +testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java + +The Parquet files can be regenerated by running the following commands in the testdata +directory: + +mvn package + +mvn exec:java \ + -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \ + -Dexec.args="--legacy_collection_format + data/parquet_nested_types_encodings/AmbiguousList.avsc.avsc + data/parquet_nested_types_encodings/AmbiguousList.avsc.json + data/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet" + +mvn exec:java \ + -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \ + -Dexec.args=" + data/parquet_nested_types_encodings/AmbiguousList.avsc.avsc + data/parquet_nested_types_encodings/AmbiguousList.avsc.json + data/parquet_nested_types_encodings/AmbiguousList_Modern.parquet" + +The script takes an Avro schema and a JSON file with data and creates a Parquet file. +The --legacy_collection_format flag makes the script output a Parquet file that uses the +legacy two-level format for nested types, rather than the modern three-level format. + +More information about the Parquet nested types format can be found here: +https://github.com/apache/parquet-format/blob/master/LogicalTypes.md http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test ---------------------------------------------------------------------- diff --git a/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test new file mode 100644 index 0000000..2ae67bc --- /dev/null +++ b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test @@ -0,0 +1,71 @@ +==== +---- QUERY +# Everythig resolves correctly assuming the 2-level encoding because the +# data file indeed uses the 2-level encoding. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=two_level_then_three_level; +select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray; +---- RESULTS +11,12,21,22 +110,120,210,220 +---- TYPES +int,int,int,int +==== +---- QUERY +# Same as above but with name-based resolution. +set parquet_fallback_schema_resolution=name; +set parquet_array_resolution=two_level_then_three_level; +select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray; +---- RESULTS +11,12,21,22 +110,120,210,220 +---- TYPES +int,int,int,int +==== +---- QUERY +# Everythig resolves correctly with TWO_LEVEL and POSITION because the +# data file indeed uses the 2-level encoding. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=two_level; +select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray; +---- RESULTS +11,12,21,22 +110,120,210,220 +---- TYPES +int,int,int,int +==== +---- QUERY +# Everythig resolves correctly with TWO_LEVEL and NAME because the +# data file indeed uses the 2-level encoding. +set parquet_fallback_schema_resolution=name; +set parquet_array_resolution=two_level; +select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray; +---- RESULTS +11,12,21,22 +110,120,210,220 +---- TYPES +int,int,int,int +==== +---- QUERY +# Incorrect field resolutions with THREE_LEVEL and POSITION +# because the data file uses the 2-level encoding. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=three_level; +select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray; +---- RESULTS +22,NULL,NULL,NULL +220,NULL,NULL,NULL +---- TYPES +int,int,int,int +==== +---- QUERY +# All fields are interpreted as missing with THREE_LEVEL and NAME. +set parquet_fallback_schema_resolution=name; +set parquet_array_resolution=three_level; +select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray; +---- RESULTS +NULL,NULL,NULL,NULL +NULL,NULL,NULL,NULL +---- TYPES +int,int,int,int +==== http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test ---------------------------------------------------------------------- diff --git a/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test new file mode 100644 index 0000000..79a9ad5 --- /dev/null +++ b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test @@ -0,0 +1,96 @@ +==== +---- QUERY +# The reference 's2.f22' is resolved assuming the 2-level encoding, and incorrectly +# returns the data of 'f11' because the data file uses the 3-level encoding. +# The other references are only resolvable assuming the 3-level encoding. +# This demonstrates the 2/3-level ambiguity with index-based resolution. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=two_level_then_three_level; +select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray; +---- RESULTS +11,12,21,11 +110,120,210,110 +---- TYPES +int,int,int,int +==== +---- QUERY +# s2.f22' incorrectly returns the data of 'f11'. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=two_level; +select s2.f22 from ambig_modern.ambigarray; +---- RESULTS +11 +110 +---- TYPES +int +==== +---- QUERY +# 'f21' does not resolve with the 2-level encoding because it matches +# a Parquet group in the schema. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=two_level; +select s2.f21 from ambig_modern.ambigarray; +---- RESULTS +---- CATCH +has an incompatible Parquet schema +---- TYPES +int +==== +---- QUERY +# 'f11' and 'f12' are interpreted as missing fields. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=two_level; +select f11, f12 from ambig_modern.ambigarray; +---- RESULTS +NULL,NULL +NULL,NULL +---- TYPES +int,int +==== +---- QUERY +# Everything resolves correctly using the THREE_LEVEL array resolution policy, +# because the data file uses the 3-level encoding. +set parquet_fallback_schema_resolution=position; +set parquet_array_resolution=three_level; +select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray; +---- RESULTS +11,12,21,22 +110,120,210,220 +---- TYPES +int,int,int,int +==== +---- QUERY +# Everything resolves correctly using the THREE_LEVEL array resolution policy, +# combined with resolution by name. +set parquet_fallback_schema_resolution=name; +set parquet_array_resolution=three_level; +select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray; +---- RESULTS +11,12,21,22 +110,120,210,220 +---- TYPES +int,int,int,int +==== +---- QUERY +# Everything resolves correctly using the TWO_LEVEL_THEN_THREE array resolution +# policy, combined with resolution by name. The problematic 's2.f22' reference +# works because the 2-level resolution fails due to a field-name mismatch, +# but the 3-level resolution succeeds. +set parquet_fallback_schema_resolution=name; +set parquet_array_resolution=two_level_then_three_level; +select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray; +---- RESULTS +11,12,21,22 +110,120,210,220 +---- TYPES +int,int,int,int +==== +---- QUERY +# Test setting a bad value for PARQUET_ARRAY_RESOLUTION. +set parquet_array_resolution=bad_value; +select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray; +---- CATCH +Invalid PARQUET_ARRAY_RESOLUTION option: 'bad_value'. Valid options are 'THREE_LEVEL', 'TWO_LEVEL' and 'TWO_LEVEL_THEN_THREE_LEVEL'. +---- TYPES +int,int,int,int +==== http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/tests/query_test/test_nested_types.py ---------------------------------------------------------------------- diff --git a/tests/query_test/test_nested_types.py b/tests/query_test/test_nested_types.py index 1c65181..ffed8b5 100644 --- a/tests/query_test/test_nested_types.py +++ b/tests/query_test/test_nested_types.py @@ -436,6 +436,89 @@ class TestParquetArrayEncodings(ImpalaTestSuite): assert len(result.data) == 1 assert result.data == ['2'] + # $ parquet-tools schema AmbiguousList_Modern.parquet + # message org.apache.impala.nested { + # required group ambigArray (LIST) { + # repeated group list { + # required group element { + # required group s2 { + # optional int32 f21; + # optional int32 f22; + # } + # optional int32 F11; + # optional int32 F12; + # } + # } + # } + # } + # $ parquet-tools cat AmbiguousList_Modern.parquet + # ambigArray: + # .list: + # ..element: + # ...s2: + # ....f21 = 21 + # ....f22 = 22 + # ...F11 = 11 + # ...F12 = 12 + # .list: + # ..element: + # ...s2: + # ....f21 = 210 + # ....f22 = 220 + # ...F11 = 110 + # ...F12 = 120 + # + # $ parquet-tools schema AmbiguousList_Legacy.parquet + # message org.apache.impala.nested { + # required group ambigArray (LIST) { + # repeated group array { + # required group s2 { + # optional int32 f21; + # optional int32 f22; + # } + # optional int32 F11; + # optional int32 F12; + # } + # } + # } + # $ parquet-tools cat AmbiguousList_Legacy.parquet + # ambigArray: + # .array: + # ..s2: + # ...f21 = 21 + # ...f22 = 22 + # ..F11 = 11 + # ..F12 = 12 + # .array: + # ..s2: + # ...f21 = 210 + # ...f22 = 220 + # ..F11 = 110 + # ..F12 = 120 + def test_ambiguous_list(self, vector, unique_database): + """IMPALA-4725: Tests the schema-resolution behavior with different values for the + PARQUET_ARRAY_RESOLUTION and PARQUET_FALLBACK_SCHEMA_RESOLUTION query options. + The schema of the Parquet test files is constructed to induce incorrect results + with index-based resolution and the default TWO_LEVEL_THEN_THREE_LEVEL array + resolution policy. Regardless of whether the Parquet data files use the 2-level or + 3-level encoding, incorrect results may be returned if the array resolution does + not exactly match the data files'. The name-based policy generally does not have + this problem because it avoids traversing incorrect schema paths. + """ + ambig_modern_tbl = "ambig_modern" + self._create_test_table(unique_database, ambig_modern_tbl, + "AmbiguousList_Modern.parquet", + "ambigarray array<struct<s2:struct<f21:int,f22:int>,f11:int,f12:int>>") + self.run_test_case('QueryTest/parquet-ambiguous-list-modern', + vector, unique_database) + + ambig_legacy_tbl = "ambig_legacy" + self._create_test_table(unique_database, ambig_legacy_tbl, + "AmbiguousList_Legacy.parquet", + "ambigarray array<struct<s2:struct<f21:int,f22:int>,f11:int,f12:int>>") + self.run_test_case('QueryTest/parquet-ambiguous-list-legacy', + vector, unique_database) + def _create_test_table(self, dbname, tablename, filename, columns): """Creates a table in the given database with the given name and columns. Copies the file with the given name from TESTFILE_DIR into the table."""
