[1/5] incubator-impala git commit: IMPALA-4725: Query option to control Parquet array resolution.

jrussell Thu, 09 Mar 2017 15:42:49 -0800

Repository: incubator-impala
Updated Branches:
  refs/heads/master cc8a11983 -> 0467b0d54



IMPALA-4725: Query option to control Parquet array resolution.

Summary of changes:
Introduces a new query option PARQUET_ARRAY_RESOLUTION to
control the path-resolution behavior for Parquet files
with nested arrays. The values are:
- THREE_LEVEL
  Assumes arrays are encoded with the 3-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 2-level resolution.
- TWO_LEVEL
  Assumes arrays are encoded with the 2-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 3-level resolution.
- TWO_LEVEL_THEN_THREE_LEVEL
  First tries to resolve assuming the 2-level representation,
  and if unsuccessful, tries the 3-level representation.
  Also resolves arrays encoded with a single level.
  This is the current Impala behavior and is used as the
  default value for compatibility.

Note that 'failure' to resolve a schema path with a given
array-resolution policy does not necessarily mean a warning or
error is returned by the query. A mismatch might be treated
like a missing field which is necessary to support schema
evolution. There is no way to reliably distinguish the
'bad resolution' and 'legitimately missing field' cases.

The new query option is independent of and can be combined
with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION.

Background:
Arrays can be represented in several ways in Parquet:
- Three Level Encoding (standard)
- Two Level Encoding (legacy)
- One Level Encoding (legacy)
More details are in the "Lists" section of the spec:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Unfortunately, there is no reliable metadata within Parquet files
to indicate which encoding was used. There is even the possibility
of having mixed encodings within the same file if there are multiple
arrays.

As a result, Impala currently tries to auto-detect the file encoding
when resolving a schema path in a Parquet file using the
TWO_LEVEL_THEN_THREE_LEVEL policy.

However, regardless of whether a Parquet data file uses the 2-level
or 3-level encoding, the index-based resolution may return incorrect
results if the representation in the Parquet file does not
exactly match the attempted array-resoution policy. Intuitively,
when attempting a 2-level resolution on a 3-level file, the matched
schema node may not be deep enough in the schema tree, but could still
be a scalar node with expected type. Similarly, when attempting a
3-level resolution on a 2-level file a level may be incorrectly
skipped.

The name-based policy generally does not have this problem because it
avoids traversing incorrect schema paths. However, the index-based
resoution allows a different set of schema-evolution operations,
so just using name-based resolution is not an acceptable workaround
in all cases.

Testing:
- Added new Parquet data files that show how incorrect results
  can be returned with a mismatched file encoding and resolution
  policy. Added both 2-level and 3-level versions of the data.
- Added a new test in test_nested_types.py that shows the behavior
  with the new PARQUET_ARRAY_RESOLUTION query option.
- Locally ran test_scanners.py and test_nested_types.py on core.

Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
Reviewed-on: http://gerrit.cloudera.org:8080/6250
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/7d8acee8
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/7d8acee8
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/7d8acee8

Branch: refs/heads/master
Commit: 7d8acee81437c040255c3219bcd3e72f10b3b772
Parents: cc8a119
Author: Alex Behm <[email protected]>
Authored: Fri Feb 24 19:04:59 2017 -0800
Committer: Impala Public Jenkins <[email protected]>
Committed: Thu Mar 9 05:07:44 2017 +0000

----------------------------------------------------------------------
 be/src/exec/hdfs-parquet-scanner.cc             |   3 +-
 be/src/exec/parquet-metadata-utils.cc           |  62 +++++++-----
 be/src/exec/parquet-metadata-utils.h            |  36 ++++---
 be/src/service/query-options.cc                 |  20 ++++
 be/src/service/query-options.h                  |   3 +-
 common/thrift/ImpalaInternalService.thrift      |  12 +++
 common/thrift/ImpalaService.thrift              |  12 +++
 .../AmbiguousList.avsc                          |  19 ++++
 .../AmbiguousList.json                          |   6 ++
 .../AmbiguousList_Legacy.parquet                | Bin 0 -> 1089 bytes
 .../AmbiguousList_Modern.parquet                | Bin 0 -> 1130 bytes
 testdata/parquet_nested_types_encodings/README  |  29 ++++++
 .../parquet-ambiguous-list-legacy.test          |  71 ++++++++++++++
 .../parquet-ambiguous-list-modern.test          |  96 +++++++++++++++++++
 tests/query_test/test_nested_types.py           |  83 ++++++++++++++++
 15 files changed, 412 insertions(+), 40 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/exec/hdfs-parquet-scanner.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/hdfs-parquet-scanner.cc 
b/be/src/exec/hdfs-parquet-scanner.cc
index bad6b33..f21ccff 100644
--- a/be/src/exec/hdfs-parquet-scanner.cc
+++ b/be/src/exec/hdfs-parquet-scanner.cc
@@ -246,7 +246,8 @@ Status HdfsParquetScanner::Open(ScannerContext* context) {
 
   // Parse the file schema into an internal representation for schema 
resolution.
   schema_resolver_.reset(new ParquetSchemaResolver(*scan_node_->hdfs_table(),
-      state_->query_options().parquet_fallback_schema_resolution));
+      state_->query_options().parquet_fallback_schema_resolution,
+      state_->query_options().parquet_array_resolution));
   RETURN_IF_ERROR(schema_resolver_->Init(&file_metadata_, filename()));
 
   // We've processed the metadata and there are columns that need to be 
materialized.

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/exec/parquet-metadata-utils.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/parquet-metadata-utils.cc 
b/be/src/exec/parquet-metadata-utils.cc
index 2527219..931161a 100644
--- a/be/src/exec/parquet-metadata-utils.cc
+++ b/be/src/exec/parquet-metadata-utils.cc
@@ -40,6 +40,14 @@ using boost::algorithm::token_compress_on;
 
 namespace impala {
 
+// Needs to be in sync with the order of enum values declared in 
TParquetArrayResolution.
+const std::vector<ParquetSchemaResolver::ArrayEncoding>
+    ParquetSchemaResolver::ORDERED_ARRAY_ENCODINGS[] =
+        {{ParquetSchemaResolver::THREE_LEVEL, 
ParquetSchemaResolver::ONE_LEVEL},
+         {ParquetSchemaResolver::TWO_LEVEL, ParquetSchemaResolver::ONE_LEVEL},
+         {ParquetSchemaResolver::TWO_LEVEL, ParquetSchemaResolver::THREE_LEVEL,
+             ParquetSchemaResolver::ONE_LEVEL}};
+
 Status ParquetMetadataUtils::ValidateFileVersion(
     const parquet::FileMetaData& file_metadata, const char* filename) {
   if (file_metadata.version > PARQUET_CURRENT_VERSION) {
@@ -384,39 +392,41 @@ Status ParquetSchemaResolver::CreateSchemaTree(
 Status ParquetSchemaResolver::ResolvePath(const SchemaPath& path, SchemaNode** 
node,
     bool* pos_field, bool* missing_field) const {
   *missing_field = false;
-  // First try two-level array encoding.
-  bool missing_field_two_level;
-  Status status_two_level =
-      ResolvePathHelper(TWO_LEVEL, path, node, pos_field, 
&missing_field_two_level);
-  if (missing_field_two_level) DCHECK(status_two_level.ok());
-  if (status_two_level.ok() && !missing_field_two_level) return Status::OK();
-  // The two-level resolution failed or reported a missing field, try 
three-level array
-  // encoding.
-  bool missing_field_three_level;
-  Status status_three_level =
-      ResolvePathHelper(THREE_LEVEL, path, node, pos_field, 
&missing_field_three_level);
-  if (missing_field_three_level) DCHECK(status_three_level.ok());
-  if (status_three_level.ok() && !missing_field_three_level) return 
Status::OK();
-  // The three-level resolution failed or reported a missing field, try 
one-level array
-  // encoding.
-  bool missing_field_one_level;
-  Status status_one_level =
-      ResolvePathHelper(ONE_LEVEL, path, node, pos_field, 
&missing_field_one_level);
-  if (missing_field_one_level) DCHECK(status_one_level.ok());
-  if (status_one_level.ok() && !missing_field_one_level) return Status::OK();
+  const vector<ArrayEncoding>& ordered_array_encodings =
+      ORDERED_ARRAY_ENCODINGS[array_resolution_];
+
+  bool any_missing_field = false;
+  Status statuses[NUM_ARRAY_ENCODINGS];
+  for (const auto& array_encoding: ordered_array_encodings) {
+    bool current_missing_field;
+    statuses[array_encoding] = ResolvePathHelper(
+        array_encoding, path, node, pos_field, &current_missing_field);
+    if (current_missing_field) DCHECK(statuses[array_encoding].ok());
+    if (statuses[array_encoding].ok() && !current_missing_field) return 
Status::OK();
+    any_missing_field = any_missing_field || current_missing_field;
+  }
   // None of resolutions yielded a node. Set *missing_field to true if any of 
the
   // resolutions reported a missing a field.
-  if (missing_field_one_level || missing_field_two_level || 
missing_field_three_level) {
+  if (any_missing_field) {
     *node = NULL;
     *missing_field = true;
     return Status::OK();
   }
-  // All resolutions failed. Log and return the status from the three-level 
resolution
-  // (which is technically the standard).
-  DCHECK(!status_one_level.ok() && !status_two_level.ok() && 
!status_three_level.ok());
+
+  // All resolutions failed. Log and return the most relevant status. The 
three-level
+  // encoding is the Parquet standard, so always prefer that. Prefer the 
two-level over
+  // the one-level because the two-level can be specifically selected via a 
query option.
+  Status error_status = Status::OK();
+  for (int i = THREE_LEVEL; i >= ONE_LEVEL; --i) {
+    if (!statuses[i].ok()) {
+      error_status = statuses[i];
+      break;
+    }
+  }
+  DCHECK(!error_status.ok());
   *node = NULL;
-  VLOG_QUERY << status_three_level.msg().msg() << "\n" << GetStackTrace();
-  return status_three_level;
+  VLOG_QUERY << error_status.msg().msg() << "\n" << GetStackTrace();
+  return error_status;
 }
 
 Status ParquetSchemaResolver::ResolvePathHelper(ArrayEncoding array_encoding,

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/exec/parquet-metadata-utils.h
----------------------------------------------------------------------
diff --git a/be/src/exec/parquet-metadata-utils.h 
b/be/src/exec/parquet-metadata-utils.h
index fdfc463..5e655b6 100644
--- a/be/src/exec/parquet-metadata-utils.h
+++ b/be/src/exec/parquet-metadata-utils.h
@@ -121,12 +121,16 @@ struct SchemaNode {
 
 /// Utility class to resolve SchemaPaths (e.g., from a table descriptor) 
against a
 /// Parquet file schema. Supports resolution by field index or by field name.
+/// Supports different policies for resolving nested arrays based on the modern
+/// three-level encoding or the legacy encodings (one and two level).
 class ParquetSchemaResolver {
  public:
   ParquetSchemaResolver(const HdfsTableDescriptor& tbl_desc,
-      TParquetFallbackSchemaResolution::type fallback_schema_resolution)
+      TParquetFallbackSchemaResolution::type fallback_schema_resolution,
+      TParquetArrayResolution::type array_resolution)
     : tbl_desc_(tbl_desc),
       fallback_schema_resolution_(fallback_schema_resolution),
+      array_resolution_(array_resolution),
       filename_(NULL) {
   }
 
@@ -146,16 +150,32 @@ class ParquetSchemaResolver {
   /// 'pos_field' is set to false. Returns a non-OK status if 'path' cannot be 
resolved
   /// against the file's schema (e.g., unrecognized collection schema).
   ///
-  /// Tries to resolve assuming either two- or three-level array encoding in
-  /// 'schema_'. Returns a bad status if resolution fails in both cases.
+  /// Tries to resolve fields within lists according to the 
'ordered_array_encodings_'.
+  /// Returns a bad status if resolution fails for all attempted array 
encodings.
   Status ResolvePath(const SchemaPath& path, SchemaNode** node, bool* 
pos_field,
       bool* missing_field) const;
 
  private:
+  /// The 'array_encoding' parameter determines whether to assume one-, two-, 
or
+  /// three-level array encoding. The returned status is not logged (i.e. it's 
an expected
+  /// error).
+  enum ArrayEncoding {
+    ONE_LEVEL,
+    TWO_LEVEL,
+    THREE_LEVEL,
+    NUM_ARRAY_ENCODINGS
+  };
+
   /// An arbitrary limit on the number of children per schema node we support.
   /// Used to sanity-check Parquet schemas.
   static const int SCHEMA_NODE_CHILDREN_SANITY_LIMIT = 64 * 1024;
 
+  /// Maps from the array-resolution policy to the ordered array encodings 
that should
+  /// be tried during path resolution. All entries have the ONE_LEVEL encoding 
at the end
+  /// because there is no ambiguity between the one-level and the other 
encodings (there
+  /// is no harm in trying it).
+  static const std::vector<ArrayEncoding> ORDERED_ARRAY_ENCODINGS[];
+
   /// Unflattens the schema metadata from a Parquet file metadata and converts 
it to our
   /// SchemaNode representation. Returns the result in 'node' unless an error 
status is
   /// returned. Does not set the slot_desc field of any SchemaNode.
@@ -167,15 +187,6 @@ class ParquetSchemaResolver {
       int max_def_level, int max_rep_level, int ira_def_level, int* idx, int* 
col_idx,
       SchemaNode* node) const;
 
-  /// The 'array_encoding' parameter determines whether to assume one-, two-, 
or
-  /// three-level array encoding. The returned status is not logged (i.e. it's 
an expected
-  /// error).
-  enum ArrayEncoding {
-    ONE_LEVEL,
-    TWO_LEVEL,
-    THREE_LEVEL
-  };
-
   Status ResolvePathHelper(ArrayEncoding array_encoding, const SchemaPath& 
path,
       SchemaNode** node, bool* pos_field, bool* missing_field) const;
 
@@ -208,6 +219,7 @@ class ParquetSchemaResolver {
 
   const HdfsTableDescriptor& tbl_desc_;
   const TParquetFallbackSchemaResolution::type fallback_schema_resolution_;
+  const TParquetArrayResolution::type array_resolution_;
   const char* filename_;
 
   /// Root node of our internal schema representation populated in Init().

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/service/query-options.cc
----------------------------------------------------------------------
diff --git a/be/src/service/query-options.cc b/be/src/service/query-options.cc
index 494d98e..1df0aae 100644
--- a/be/src/service/query-options.cc
+++ b/be/src/service/query-options.cc
@@ -396,6 +396,26 @@ Status impala::SetQueryOption(const string& key, const 
string& value,
         }
         break;
       }
+      case TImpalaQueryOptions::PARQUET_ARRAY_RESOLUTION: {
+        if (iequals(value, "three_level") ||
+            value == to_string(TParquetArrayResolution::THREE_LEVEL)) {
+          query_options->__set_parquet_array_resolution(
+              TParquetArrayResolution::THREE_LEVEL);
+        } else if (iequals(value, "two_level") ||
+            value == to_string(TParquetArrayResolution::TWO_LEVEL)) {
+          query_options->__set_parquet_array_resolution(
+              TParquetArrayResolution::TWO_LEVEL);
+        } else if (iequals(value, "two_level_then_three_level") ||
+            value == 
to_string(TParquetArrayResolution::TWO_LEVEL_THEN_THREE_LEVEL)) {
+          query_options->__set_parquet_array_resolution(
+              TParquetArrayResolution::TWO_LEVEL_THEN_THREE_LEVEL);
+        } else {
+          return Status(Substitute("Invalid PARQUET_ARRAY_RESOLUTION option: 
'$0'. "
+              "Valid options are 'THREE_LEVEL', 'TWO_LEVEL' and "
+              "'TWO_LEVEL_THEN_THREE_LEVEL'.", value));
+        }
+        break;
+      }
       case TImpalaQueryOptions::MT_DOP: {
         StringParser::ParseResult result;
         const int32_t dop =

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/be/src/service/query-options.h
----------------------------------------------------------------------
diff --git a/be/src/service/query-options.h b/be/src/service/query-options.h
index f5220ea..ddb1b23 100644
--- a/be/src/service/query-options.h
+++ b/be/src/service/query-options.h
@@ -35,7 +35,7 @@ class TQueryOptions;
 // the DCHECK.
 #define QUERY_OPTS_TABLE\
   DCHECK_EQ(_TImpalaQueryOptions_VALUES_TO_NAMES.size(),\
-      TImpalaQueryOptions::PARQUET_DICTIONARY_FILTERING + 1);\
+      TImpalaQueryOptions::PARQUET_ARRAY_RESOLUTION + 1);\
   QUERY_OPT_FN(abort_on_default_limit_exceeded, 
ABORT_ON_DEFAULT_LIMIT_EXCEEDED)\
   QUERY_OPT_FN(abort_on_error, ABORT_ON_ERROR)\
   QUERY_OPT_FN(allow_unsupported_formats, ALLOW_UNSUPPORTED_FORMATS)\
@@ -89,6 +89,7 @@ class TQueryOptions;
   QUERY_OPT_FN(enable_expr_rewrites, ENABLE_EXPR_REWRITES)\
   QUERY_OPT_FN(decimal_v2, DECIMAL_V2)\
   QUERY_OPT_FN(parquet_dictionary_filtering, PARQUET_DICTIONARY_FILTERING)\
+  QUERY_OPT_FN(parquet_array_resolution, PARQUET_ARRAY_RESOLUTION)\
   ;
 
 

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/common/thrift/ImpalaInternalService.thrift
----------------------------------------------------------------------
diff --git a/common/thrift/ImpalaInternalService.thrift 
b/common/thrift/ImpalaInternalService.thrift
index ddf137a..433f3b4 100644
--- a/common/thrift/ImpalaInternalService.thrift
+++ b/common/thrift/ImpalaInternalService.thrift
@@ -49,6 +49,14 @@ enum TParquetFallbackSchemaResolution {
   NAME
 }
 
+// The order of the enum values needs to be kepy in sync with
+// ParquetMetadataUtils::ORDERED_ARRAY_ENCODINGS in parquet-metadata-utils.cc.
+enum TParquetArrayResolution {
+  THREE_LEVEL,
+  TWO_LEVEL,
+  TWO_LEVEL_THEN_THREE_LEVEL
+}
+
 // Query options that correspond to ImpalaService.ImpalaQueryOptions, with 
their
 // respective defaults. Query options can be set in the following ways:
 //
@@ -224,6 +232,10 @@ struct TQueryOptions {
 
   // Indicates whether to use dictionary filtering for Parquet files
   53: optional bool parquet_dictionary_filtering = true
+
+  // Policy for resolving nested array fields in Parquet files.
+  54: optional TParquetArrayResolution parquet_array_resolution =
+    TParquetArrayResolution.TWO_LEVEL_THEN_THREE_LEVEL
 }
 
 // Impala currently has two types of sessions: Beeswax and HiveServer2

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/common/thrift/ImpalaService.thrift
----------------------------------------------------------------------
diff --git a/common/thrift/ImpalaService.thrift 
b/common/thrift/ImpalaService.thrift
index 28ee3df..87832ad 100644
--- a/common/thrift/ImpalaService.thrift
+++ b/common/thrift/ImpalaService.thrift
@@ -254,6 +254,18 @@ enum TImpalaQueryOptions {
 
   // Indicates whether to use dictionary filtering for Parquet files
   PARQUET_DICTIONARY_FILTERING,
+
+  // Policy for resolving nested array fields in Parquet files.
+  // An Impala array type can have several different representations in
+  // a Parquet schema (three, two, or one level). There is fundamental 
ambiguity
+  // between the two and three level encodings with index-based field 
resolution.
+  // The ambiguity can manually be resolved using this query option, or by 
using
+  // PARQUET_FALLBACK_SCHEMA_RESOLUTION=name.
+  // The value TWO_LEVEL_THEN_THREE_LEVEL was the default mode since Impala 
2.3.
+  // It is preserved as the default for compatibility.
+  // TODO: Remove the TWO_LEVEL_THEN_THREE_LEVEL mode completely or at least 
make
+  // it non-default in a compatibility breaking release.
+  PARQUET_ARRAY_RESOLUTION
 }
 
 // The summary of a DML statement.

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList.avsc
----------------------------------------------------------------------
diff --git a/testdata/parquet_nested_types_encodings/AmbiguousList.avsc 
b/testdata/parquet_nested_types_encodings/AmbiguousList.avsc
new file mode 100644
index 0000000..d53089b
--- /dev/null
+++ b/testdata/parquet_nested_types_encodings/AmbiguousList.avsc
@@ -0,0 +1,19 @@
+{"type": "record",
+"namespace": "org.apache.impala",
+"name": "nested",
+"fields": [
+  {"name": "ambigArray", "type": {
+    "type": "array", "items": {
+      "name": "r1", "type": "record", "fields": [
+        {"name": "s2", "type":
+          {"name": "r1s2", "type": "record", "fields": [
+            {"name": "f21", "type": ["int", "null"]},
+            {"name": "f22", "type": ["int", "null"]}
+          ]}
+        },
+        {"name": "F11", "type": ["int", "null"]},
+        {"name": "F12", "type": ["int", "null"]}
+      ]
+    }
+  }}
+]}

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList.json
----------------------------------------------------------------------
diff --git a/testdata/parquet_nested_types_encodings/AmbiguousList.json 
b/testdata/parquet_nested_types_encodings/AmbiguousList.json
new file mode 100644
index 0000000..04733bc
--- /dev/null
+++ b/testdata/parquet_nested_types_encodings/AmbiguousList.json
@@ -0,0 +1,6 @@
+[{
+  "ambigArray": [
+    {"s2": {"f21": 21, "f22": 22}, "F11": 11, "F12": 12},
+    {"s2": {"f21": 210, "f22": 220}, "F11": 110, "F12": 120}
+  ]
+}]

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
----------------------------------------------------------------------
diff --git 
a/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet 
b/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet
new file mode 100644
index 0000000..fa6574e
Binary files /dev/null and 
b/testdata/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet differ

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
----------------------------------------------------------------------
diff --git 
a/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet 
b/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet
new file mode 100644
index 0000000..211a500
Binary files /dev/null and 
b/testdata/parquet_nested_types_encodings/AmbiguousList_Modern.parquet differ

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/parquet_nested_types_encodings/README
----------------------------------------------------------------------
diff --git a/testdata/parquet_nested_types_encodings/README 
b/testdata/parquet_nested_types_encodings/README
new file mode 100644
index 0000000..3b46f1d
--- /dev/null
+++ b/testdata/parquet_nested_types_encodings/README
@@ -0,0 +1,29 @@
+The Parquet files AmbiguousList_Modern.parquet and 
AmbiguousList_Legacy.parquet were generated
+using the kite script located here:
+testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
+
+The Parquet files can be regenerated by running the following commands in the 
testdata
+directory:
+
+mvn package
+
+mvn exec:java \
+  -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \
+  -Dexec.args="--legacy_collection_format
+  data/parquet_nested_types_encodings/AmbiguousList.avsc.avsc
+  data/parquet_nested_types_encodings/AmbiguousList.avsc.json
+  data/parquet_nested_types_encodings/AmbiguousList_Legacy.parquet"
+
+mvn exec:java \
+  -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \
+  -Dexec.args="
+  data/parquet_nested_types_encodings/AmbiguousList.avsc.avsc
+  data/parquet_nested_types_encodings/AmbiguousList.avsc.json
+  data/parquet_nested_types_encodings/AmbiguousList_Modern.parquet"
+
+The script takes an Avro schema and a JSON file with data and creates a 
Parquet file.
+The --legacy_collection_format flag makes the script output a Parquet file 
that uses the
+legacy two-level format for nested types, rather than the modern three-level 
format.
+
+More information about the Parquet nested types format can be found here:
+https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
----------------------------------------------------------------------
diff --git 
a/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
 
b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
new file mode 100644
index 0000000..2ae67bc
--- /dev/null
+++ 
b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
@@ -0,0 +1,71 @@
+====
+---- QUERY
+# Everythig resolves correctly assuming the 2-level encoding because the
+# data file indeed uses the 2-level encoding.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=two_level_then_three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray;
+---- RESULTS
+11,12,21,22
+110,120,210,220
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# Same as above but with name-based resolution.
+set parquet_fallback_schema_resolution=name;
+set parquet_array_resolution=two_level_then_three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray;
+---- RESULTS
+11,12,21,22
+110,120,210,220
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# Everythig resolves correctly with TWO_LEVEL and POSITION because the
+# data file indeed uses the 2-level encoding.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=two_level;
+select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray;
+---- RESULTS
+11,12,21,22
+110,120,210,220
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# Everythig resolves correctly with TWO_LEVEL and NAME because the
+# data file indeed uses the 2-level encoding.
+set parquet_fallback_schema_resolution=name;
+set parquet_array_resolution=two_level;
+select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray;
+---- RESULTS
+11,12,21,22
+110,120,210,220
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# Incorrect field resolutions with THREE_LEVEL and POSITION
+# because the data file uses the 2-level encoding.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray;
+---- RESULTS
+22,NULL,NULL,NULL
+220,NULL,NULL,NULL
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# All fields are interpreted as missing with THREE_LEVEL and NAME.
+set parquet_fallback_schema_resolution=name;
+set parquet_array_resolution=three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_legacy.ambigarray;
+---- RESULTS
+NULL,NULL,NULL,NULL
+NULL,NULL,NULL,NULL
+---- TYPES
+int,int,int,int
+====

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
----------------------------------------------------------------------
diff --git 
a/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
 
b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
new file mode 100644
index 0000000..79a9ad5
--- /dev/null
+++ 
b/testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-modern.test
@@ -0,0 +1,96 @@
+====
+---- QUERY
+# The reference 's2.f22' is resolved assuming the 2-level encoding, and 
incorrectly
+# returns the data of 'f11' because the data file uses the 3-level encoding.
+# The other references are only resolvable assuming the 3-level encoding.
+# This demonstrates the 2/3-level ambiguity with index-based resolution.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=two_level_then_three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray;
+---- RESULTS
+11,12,21,11
+110,120,210,110
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# s2.f22' incorrectly returns the data of 'f11'.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=two_level;
+select s2.f22 from ambig_modern.ambigarray;
+---- RESULTS
+11
+110
+---- TYPES
+int
+====
+---- QUERY
+# 'f21' does not resolve with the 2-level encoding because it matches
+# a Parquet group in the schema.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=two_level;
+select s2.f21 from ambig_modern.ambigarray;
+---- RESULTS
+---- CATCH
+has an incompatible Parquet schema
+---- TYPES
+int
+====
+---- QUERY
+# 'f11' and 'f12' are interpreted as missing fields.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=two_level;
+select f11, f12 from ambig_modern.ambigarray;
+---- RESULTS
+NULL,NULL
+NULL,NULL
+---- TYPES
+int,int
+====
+---- QUERY
+# Everything resolves correctly using the THREE_LEVEL array resolution policy,
+# because the data file uses the 3-level encoding.
+set parquet_fallback_schema_resolution=position;
+set parquet_array_resolution=three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray;
+---- RESULTS
+11,12,21,22
+110,120,210,220
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# Everything resolves correctly using the THREE_LEVEL array resolution policy,
+# combined with resolution by name.
+set parquet_fallback_schema_resolution=name;
+set parquet_array_resolution=three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray;
+---- RESULTS
+11,12,21,22
+110,120,210,220
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# Everything resolves correctly using the TWO_LEVEL_THEN_THREE array resolution
+# policy, combined with resolution by name. The problematic 's2.f22' reference
+# works because the 2-level resolution fails due to a field-name mismatch,
+# but the 3-level resolution succeeds.
+set parquet_fallback_schema_resolution=name;
+set parquet_array_resolution=two_level_then_three_level;
+select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray;
+---- RESULTS
+11,12,21,22
+110,120,210,220
+---- TYPES
+int,int,int,int
+====
+---- QUERY
+# Test setting a bad value for PARQUET_ARRAY_RESOLUTION.
+set parquet_array_resolution=bad_value;
+select f11, f12, s2.f21, s2.f22 from ambig_modern.ambigarray;
+---- CATCH
+Invalid PARQUET_ARRAY_RESOLUTION option: 'bad_value'. Valid options are 
'THREE_LEVEL', 'TWO_LEVEL' and 'TWO_LEVEL_THEN_THREE_LEVEL'.
+---- TYPES
+int,int,int,int
+====

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/7d8acee8/tests/query_test/test_nested_types.py
----------------------------------------------------------------------
diff --git a/tests/query_test/test_nested_types.py 
b/tests/query_test/test_nested_types.py
index 1c65181..ffed8b5 100644
--- a/tests/query_test/test_nested_types.py
+++ b/tests/query_test/test_nested_types.py
@@ -436,6 +436,89 @@ class TestParquetArrayEncodings(ImpalaTestSuite):
     assert len(result.data) == 1
     assert result.data == ['2']
 
+  # $ parquet-tools schema AmbiguousList_Modern.parquet
+  # message org.apache.impala.nested {
+  #   required group ambigArray (LIST) {
+  #     repeated group list {
+  #       required group element {
+  #         required group s2 {
+  #           optional int32 f21;
+  #           optional int32 f22;
+  #         }
+  #         optional int32 F11;
+  #         optional int32 F12;
+  #       }
+  #     }
+  #   }
+  # }
+  # $ parquet-tools cat AmbiguousList_Modern.parquet
+  # ambigArray:
+  # .list:
+  # ..element:
+  # ...s2:
+  # ....f21 = 21
+  # ....f22 = 22
+  # ...F11 = 11
+  # ...F12 = 12
+  # .list:
+  # ..element:
+  # ...s2:
+  # ....f21 = 210
+  # ....f22 = 220
+  # ...F11 = 110
+  # ...F12 = 120
+  #
+  # $ parquet-tools schema AmbiguousList_Legacy.parquet
+  # message org.apache.impala.nested {
+  #  required group ambigArray (LIST) {
+  #    repeated group array {
+  #       required group s2 {
+  #         optional int32 f21;
+  #         optional int32 f22;
+  #       }
+  #       optional int32 F11;
+  #       optional int32 F12;
+  #     }
+  #   }
+  # }
+  # $ parquet-tools cat AmbiguousList_Legacy.parquet
+  # ambigArray:
+  # .array:
+  # ..s2:
+  # ...f21 = 21
+  # ...f22 = 22
+  # ..F11 = 11
+  # ..F12 = 12
+  # .array:
+  # ..s2:
+  # ...f21 = 210
+  # ...f22 = 220
+  # ..F11 = 110
+  # ..F12 = 120
+  def test_ambiguous_list(self, vector, unique_database):
+    """IMPALA-4725: Tests the schema-resolution behavior with different values 
for the
+    PARQUET_ARRAY_RESOLUTION and PARQUET_FALLBACK_SCHEMA_RESOLUTION query 
options.
+    The schema of the Parquet test files is constructed to induce incorrect 
results
+    with index-based resolution and the default TWO_LEVEL_THEN_THREE_LEVEL 
array
+    resolution policy. Regardless of whether the Parquet data files use the 
2-level or
+    3-level encoding, incorrect results may be returned if the array 
resolution does
+    not exactly match the data files'. The name-based policy generally does 
not have
+    this problem because it avoids traversing incorrect schema paths.
+    """
+    ambig_modern_tbl = "ambig_modern"
+    self._create_test_table(unique_database, ambig_modern_tbl,
+        "AmbiguousList_Modern.parquet",
+        "ambigarray array<struct<s2:struct<f21:int,f22:int>,f11:int,f12:int>>")
+    self.run_test_case('QueryTest/parquet-ambiguous-list-modern',
+                        vector, unique_database)
+
+    ambig_legacy_tbl = "ambig_legacy"
+    self._create_test_table(unique_database, ambig_legacy_tbl,
+        "AmbiguousList_Legacy.parquet",
+        "ambigarray array<struct<s2:struct<f21:int,f22:int>,f11:int,f12:int>>")
+    self.run_test_case('QueryTest/parquet-ambiguous-list-legacy',
+                        vector, unique_database)
+
   def _create_test_table(self, dbname, tablename, filename, columns):
     """Creates a table in the given database with the given name and columns. 
Copies
     the file with the given name from TESTFILE_DIR into the table."""

[1/5] incubator-impala git commit: IMPALA-4725: Query option to control Parquet array resolution.

Reply via email to