[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-13 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r784505712



##
File path: c++/test/TestReader.cc
##
@@ -137,4 +139,89 @@ namespace orc {
 CheckFileWithSargs("bad_bloom_filter_1.6.11.orc", "ORC C++ 1.6.11");
 CheckFileWithSargs("bad_bloom_filter_1.6.0.orc", "ORC C++");
   }
+
+  /**
+   * Read complextypes_iceberg.orc and verify the resolved selections.
+   *
+   * The ORC file has the following schema:
+   *   struct<
+   * id:bigint,
+   * int_array:array,
+   * int_array_array:array>,
+   * int_map:map,
+   * int_map_array:array>,
+   * nested_struct:struct<
+   *   a:int,
+   *   b:array,
+   *   c:struct<
+   * d:array>>
+   *   >,
+   *   g:map
+   * >
+   *   >>
+   * >
+   *   >
+   *
+   * @param readIntents TypeReadIntents describing the section.
+   * @param expectedSelection expected TypeIds that will be selected from given
+   * readIntents.
+   */
+  void verifySelection(const RowReaderOptions::TypeReadIntents ,
+   const std::vector ) {
+std::string fileName = "complextypes_iceberg.orc";
+std::stringstream ss;
+if (const char* example_dir = std::getenv("ORC_EXAMPLE_DIR")) {
+  ss << example_dir;
+} else {
+  ss << "../../../examples";
+}
+ss << "/" << fileName;
+ReaderOptions readerOpts;
+std::unique_ptr reader =
+createReader(readLocalFile(ss.str().c_str()), readerOpts);
+
+RowReaderOptions rowReaderOpts;
+rowReaderOpts.includeTypesWithIntents(readIntents);
+std::unique_ptr rowReader =
+reader->createRowReader(rowReaderOpts);
+std::vector expected(reader->getType().getMaximumColumnId() + 1,
+   false);
+for (auto id : expectedSelection) {
+  expected[id] = true;
+}
+ASSERT_THAT(rowReader->getSelectedColumns(), ElementsAreArray(expected));
+  }
+
+  TEST(TestReadIntent, testListAll) {
+// select all of int_array_array.
+verifySelection({{4, ReadIntent_ALL}}, {0, 4, 5, 6});
+  }
+
+  TEST(TestReadIntent, testListOffsets) {
+// select only the offsets of int_array_array.
+verifySelection({{4, ReadIntent_OFFSETS}}, {0, 4});

Review comment:
   Added in TestReadIntent.testRowBatchContent

##
File path: c++/test/TestReader.cc
##
@@ -137,4 +139,89 @@ namespace orc {
 CheckFileWithSargs("bad_bloom_filter_1.6.11.orc", "ORC C++ 1.6.11");
 CheckFileWithSargs("bad_bloom_filter_1.6.0.orc", "ORC C++");
   }
+
+  /**
+   * Read complextypes_iceberg.orc and verify the resolved selections.
+   *
+   * The ORC file has the following schema:
+   *   struct<
+   * id:bigint,
+   * int_array:array,
+   * int_array_array:array>,
+   * int_map:map,
+   * int_map_array:array>,
+   * nested_struct:struct<
+   *   a:int,
+   *   b:array,
+   *   c:struct<
+   * d:array>>
+   *   >,
+   *   g:map
+   * >
+   *   >>
+   * >
+   *   >
+   *
+   * @param readIntents TypeReadIntents describing the section.
+   * @param expectedSelection expected TypeIds that will be selected from given
+   * readIntents.
+   */
+  void verifySelection(const RowReaderOptions::TypeReadIntents ,
+   const std::vector ) {
+std::string fileName = "complextypes_iceberg.orc";
+std::stringstream ss;
+if (const char* example_dir = std::getenv("ORC_EXAMPLE_DIR")) {
+  ss << example_dir;
+} else {
+  ss << "../../../examples";
+}
+ss << "/" << fileName;
+ReaderOptions readerOpts;
+std::unique_ptr reader =
+createReader(readLocalFile(ss.str().c_str()), readerOpts);
+
+RowReaderOptions rowReaderOpts;
+rowReaderOpts.includeTypesWithIntents(readIntents);
+std::unique_ptr rowReader =
+reader->createRowReader(rowReaderOpts);
+std::vector expected(reader->getType().getMaximumColumnId() + 1,
+   false);
+for (auto id : expectedSelection) {
+  expected[id] = true;
+}
+ASSERT_THAT(rowReader->getSelectedColumns(), ElementsAreArray(expected));
+  }
+
+  TEST(TestReadIntent, testListAll) {
+// select all of int_array_array.
+verifySelection({{4, ReadIntent_ALL}}, {0, 4, 5, 6});
+  }
+
+  TEST(TestReadIntent, testListOffsets) {
+// select only the offsets of int_array_array.
+verifySelection({{4, ReadIntent_OFFSETS}}, {0, 4});
+
+// select only the offsets of int_array_array.item.
+verifySelection({{4, ReadIntent_OFFSETS}, {5, ReadIntent_OFFSETS}},
+{0, 4, 5});

Review comment:
   Added as the 3rd test in TestReadIntent.testListOffsets




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this 

[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-12 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r783509278



##
File path: c++/test/TestReader.cc
##
@@ -137,4 +139,89 @@ namespace orc {
 CheckFileWithSargs("bad_bloom_filter_1.6.11.orc", "ORC C++ 1.6.11");
 CheckFileWithSargs("bad_bloom_filter_1.6.0.orc", "ORC C++");
   }
+
+  /**
+   * Read complextypes_iceberg.orc and verify the resolved selections.
+   *
+   * The ORC file has the following schema:
+   *   struct<
+   * id:bigint,
+   * int_array:array,
+   * int_array_array:array>,
+   * int_map:map,
+   * int_map_array:array>,
+   * nested_struct:struct<
+   *   a:int,
+   *   b:array,
+   *   c:struct<
+   * d:array>>
+   *   >,
+   *   g:map
+   * >
+   *   >>
+   * >
+   *   >
+   *
+   * @param readIntents TypeReadIntents describing the section.

Review comment:
   I believe this has been fixed at commit 
[11d66a2](https://github.com/apache/orc/pull/990/commits/11d66a2035f1d91346e1e44986e7ffd4e20e5b0f)

##
File path: c++/test/TestReader.cc
##
@@ -137,4 +139,89 @@ namespace orc {
 CheckFileWithSargs("bad_bloom_filter_1.6.11.orc", "ORC C++ 1.6.11");
 CheckFileWithSargs("bad_bloom_filter_1.6.0.orc", "ORC C++");
   }
+
+  /**
+   * Read complextypes_iceberg.orc and verify the resolved selections.
+   *
+   * The ORC file has the following schema:
+   *   struct<
+   * id:bigint,
+   * int_array:array,
+   * int_array_array:array>,
+   * int_map:map,
+   * int_map_array:array>,
+   * nested_struct:struct<
+   *   a:int,
+   *   b:array,
+   *   c:struct<
+   * d:array>>
+   *   >,
+   *   g:map
+   * >
+   *   >>
+   * >
+   *   >
+   *
+   * @param readIntents TypeReadIntents describing the section.
+   * @param expectedSelection expected TypeIds that will be selected from given
+   * readIntents.
+   */
+  void verifySelection(const RowReaderOptions::TypeReadIntents ,
+   const std::vector ) {
+std::string fileName = "complextypes_iceberg.orc";
+std::stringstream ss;
+if (const char* example_dir = std::getenv("ORC_EXAMPLE_DIR")) {
+  ss << example_dir;
+} else {
+  ss << "../../../examples";
+}
+ss << "/" << fileName;
+ReaderOptions readerOpts;
+std::unique_ptr reader =
+createReader(readLocalFile(ss.str().c_str()), readerOpts);
+
+RowReaderOptions rowReaderOpts;
+rowReaderOpts.includeTypesWithIntents(readIntents);
+std::unique_ptr rowReader =
+reader->createRowReader(rowReaderOpts);
+std::vector expected(reader->getType().getMaximumColumnId() + 1,
+   false);
+for (auto id : expectedSelection) {
+  expected[id] = true;
+}
+ASSERT_THAT(rowReader->getSelectedColumns(), ElementsAreArray(expected));
+  }
+
+  TEST(TestReadIntent, testListAll) {
+// select all of int_array_array.
+verifySelection({{4, ReadIntent_ALL}}, {0, 4, 5, 6});
+  }
+
+  TEST(TestReadIntent, testListOffsets) {
+// select only the offsets of int_array_array.
+verifySelection({{4, ReadIntent_OFFSETS}}, {0, 4});

Review comment:
   OK. Will add such test in my next commit.

##
File path: c++/test/TestReader.cc
##
@@ -137,4 +139,89 @@ namespace orc {
 CheckFileWithSargs("bad_bloom_filter_1.6.11.orc", "ORC C++ 1.6.11");
 CheckFileWithSargs("bad_bloom_filter_1.6.0.orc", "ORC C++");
   }
+
+  /**
+   * Read complextypes_iceberg.orc and verify the resolved selections.
+   *
+   * The ORC file has the following schema:
+   *   struct<
+   * id:bigint,
+   * int_array:array,
+   * int_array_array:array>,
+   * int_map:map,
+   * int_map_array:array>,
+   * nested_struct:struct<
+   *   a:int,
+   *   b:array,
+   *   c:struct<
+   * d:array>>
+   *   >,
+   *   g:map
+   * >
+   *   >>
+   * >
+   *   >
+   *
+   * @param readIntents TypeReadIntents describing the section.
+   * @param expectedSelection expected TypeIds that will be selected from given
+   * readIntents.
+   */
+  void verifySelection(const RowReaderOptions::TypeReadIntents ,
+   const std::vector ) {
+std::string fileName = "complextypes_iceberg.orc";
+std::stringstream ss;
+if (const char* example_dir = std::getenv("ORC_EXAMPLE_DIR")) {
+  ss << example_dir;
+} else {
+  ss << "../../../examples";
+}
+ss << "/" << fileName;
+ReaderOptions readerOpts;
+std::unique_ptr reader =
+createReader(readLocalFile(ss.str().c_str()), readerOpts);
+
+RowReaderOptions rowReaderOpts;
+rowReaderOpts.includeTypesWithIntents(readIntents);
+std::unique_ptr rowReader =
+reader->createRowReader(rowReaderOpts);
+std::vector expected(reader->getType().getMaximumColumnId() + 1,
+   

[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-11 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r782589414



##
File path: c++/src/Reader.hh
##
@@ -88,10 +88,12 @@ namespace orc {
 // Select a field by id
 void updateSelectedByFieldId(std::vector& selectedColumns, uint64_t 
fieldId);
 // Select a type by id
-void updateSelectedByTypeId(std::vector& selectedColumns, uint64_t 
typeId);
+void updateSelectedByTypeId(std::vector& selectedColumns, uint64_t 
typeId,
+const RowReaderOptions::IdReadIntentMap& 
idReadIntentMap);
 
 // Select all of the recursive children of the given type.
-void selectChildren(std::vector& selectedColumns, const Type& type);
+void selectChildren(std::vector& selectedColumns, const Type& type,
+const RowReaderOptions::IdReadIntentMap& 
idReadIntentMap);

Review comment:
   Clarified the comment as follow
   ```
   // Select a type id of the given type.
   // This function may also select all of the recursive children of the 
given type
   // depending on the read intent of that type in idReadIntentMap.
   void selectChildren(std::vector& selectedColumns, const Type& type,
   const RowReaderOptions::IdReadIntentMap& 
idReadIntentMap);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-11 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r782582966



##
File path: c++/src/Reader.cc
##
@@ -41,6 +41,8 @@ namespace orc {
 "1.6.0", "1.6.1", "1.6.2", "1.6.3", "1.6.4", "1.6.5", "1.6.6", "1.6.7", 
"1.6.8",
 "1.6.9", "1.6.10", "1.6.11", "1.7.0"};
 
+  static const RowReaderOptions::IdReadIntentMap EMPTY_ID_READ_INTENT_MAP;

Review comment:
   I replaced it with function `emptyIdReadIntentMap()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-11 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r782518019



##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,25 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map type of .
+ */
+typedef std::map TypeReadIntents;
+
+/**
+ * Selects which type ids to read and specific ReadIntents for each
+ * type id. The root type is always 0 and the rest of the types are

Review comment:
   Removed the sentece.

##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,25 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map type of .
+ */
+typedef std::map TypeReadIntents;
+
+/**
+ * Selects which type ids to read and specific ReadIntents for each
+ * type id. The root type is always 0 and the rest of the types are
+ * labeled in a preorder traversal of the tree. The parent types are

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-11 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r782517663



##
File path: c++/include/orc/Common.hh
##
@@ -122,6 +122,19 @@ namespace orc {
 StreamKind_BLOOM_FILTER_UTF8 = 8
   };
 
+  /**
+   * Specific read intention when selecting a certain TypeId.
+   * This enum currently only being utilized by LIST type selection.
+   *
+   * TODO: extend this to support MAP and UNION type.

Review comment:
   Removed the TODO. I will file a follow-up JIRA after this PR merged.

##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,25 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map type of .
+ */
+typedef std::map TypeReadIntents;

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-08 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780619426



##
File path: c++/include/orc/Common.hh
##
@@ -122,6 +122,11 @@ namespace orc {
 StreamKind_BLOOM_FILTER_UTF8 = 8
   };
 
+  enum ArrayReadIntent {

Review comment:
   Renamed back to `ReadIntent`, gave description comments, and TODO for 
MAP and UNION type.

##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,26 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of  used as a parameter in
+ * RowReaderOptions::includeTypesWithIntents.

Review comment:
   Deleted in latest commit.

##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,26 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of  used as a parameter in
+ * RowReaderOptions::includeTypesWithIntents.
+ */
+typedef std::map TypeReadIntents;
+
+/**
+ * Selects which type ids to read and specific ArrayReadIntents for each
+ * type id. The root type is always 0 and the rest of the types are
+ * labeled in a preorder traversal of the tree. The parent types are
+ * automatically selected, but the children are not.
+ *
+ * This option clears any previous setting of the selected columns or
+ * types.
+ * @param typesWithIntents a map of .

Review comment:
   Deleted in latest commit.

##
File path: c++/include/orc/Reader.hh
##
@@ -266,6 +286,12 @@ namespace orc {
  * Get desired timezone to return data of timestamp type
  */
 const std::string& getTimezoneName() const;
+
+/**
+ * Get the  map that was supplied through

Review comment:
   Deleted in latest commit.

##
File path: c++/include/orc/Reader.hh
##
@@ -565,6 +591,15 @@ namespace orc {
  */
 virtual void seekToRow(uint64_t rowNumber) = 0;
 
+/**
+ * Get an ArrayReadIntent for a given typeId.
+ * @param typeId the type id to look up.
+ * @return an ArrayReadIntent that was specified for given typeId through
+ *RowReaderOptions::includeTypesWithIntents. If no ArrayReadIntent was
+ *specified for typeId, return ArrayReadIntent_ALL.
+ */
+virtual ArrayReadIntent getReadIntent(uint64_t typeId) const = 0;

Review comment:
   Deleted in latest commit.

##
File path: c++/src/ColumnReader.cc
##
@@ -993,6 +993,9 @@ namespace orc {
 void nextInternal(ColumnVectorBatch& rowBatch,
   uint64_t numValues,
   char *notNull);
+
+void readArrayIndices(const int64_t *arrayOffsets, uint64_t arrayCount,

Review comment:
   Deleted in latest commit.

##
File path: c++/src/Reader.cc
##
@@ -290,6 +291,15 @@ namespace orc {
 return selectedColumns;
   }
 
+  ArrayReadIntent RowReaderImpl::getReadIntent(uint64_t typeId) const {
+auto elem = readIntents.find(typeId);
+if (elem == readIntents.end()) {
+  return ArrayReadIntent_ALL;
+} else {
+  return elem->second;

Review comment:
   Created 4 new test:
   
   - TestReadIntent.testListAll
   - TestReadIntent.testListOffsets
   - TestReadIntent.testListAllAndOffsets
   - TestReadIntent.testListConflictingIntent
   

##
File path: c++/src/ColumnReader.cc
##
@@ -1006,9 +1009,12 @@ namespace orc {
 if (stream == nullptr)
   throw ParseError("LENGTH stream not found in List column");
 rle = createRleDecoder(std::move(stream), false, vers, memoryPool);
-const Type& childType = *type.getSubtype(0);
-if (selectedColumns[static_cast(childType.getColumnId())]) {
-  child = buildReader(childType, stripe);
+ArrayReadIntent intent = stripe.getReadIntent(columnId);
+if (intent == ArrayReadIntent_ALL) {
+  const Type  = *type.getSubtype(0);
+  if (selectedColumns[static_cast(childType.getColumnId())]) {
+child = buildReader(childType, stripe);
+  }

Review comment:
   I replace this with 4 new tests at `c++/test/TestReader.cc`.
   However, they stop at column selection verification and do not read a row 
batch.

##
File path: c++/src/Options.hh
##
@@ -174,20 +175,36 @@ namespace orc {
 privateBits->selection = ColumnSelection_FIELD_IDS;
 privateBits->includedColumnIndexes.assign(include.begin(), include.end());
 privateBits->includedColumnNames.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::include(const std::list& 
include) {
 privateBits->selection = ColumnSelection_NAMES;
 privateBits->includedColumnNames.assign(include.begin(), include.end());
 privateBits->includedColumnIndexes.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::includeTypes(const std::list& 
types) 

[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780620496



##
File path: c++/src/Vector.cc
##
@@ -242,7 +242,6 @@ namespace orc {
 }
 return false;
   }
-

Review comment:
   Reverted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780620362



##
File path: c++/src/Options.hh
##
@@ -174,20 +175,36 @@ namespace orc {
 privateBits->selection = ColumnSelection_FIELD_IDS;
 privateBits->includedColumnIndexes.assign(include.begin(), include.end());
 privateBits->includedColumnNames.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::include(const std::list& 
include) {
 privateBits->selection = ColumnSelection_NAMES;
 privateBits->includedColumnNames.assign(include.begin(), include.end());
 privateBits->includedColumnIndexes.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::includeTypes(const std::list& 
types) {
 privateBits->selection = ColumnSelection_TYPE_IDS;
 privateBits->includedColumnIndexes.assign(types.begin(), types.end());
 privateBits->includedColumnNames.clear();
+privateBits->readIntents.clear();
+return *this;
+  }
+
+  RowReaderOptions&
+  RowReaderOptions::includeTypesWithIntents(const TypeReadIntents& 
typesAndIntents) {
+privateBits->selection = ColumnSelection_TYPE_IDS;
+privateBits->includedColumnIndexes.clear();
+privateBits->readIntents.clear();
+for (const auto& typeIntentPair : typesAndIntents ) {
+  privateBits->readIntents[typeIntentPair.first] = typeIntentPair.second;

Review comment:
   It is now resolved by relying on `ColumnSelector::selectParents()` to 
override the decision.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780620218



##
File path: c++/src/ColumnReader.cc
##
@@ -1006,9 +1009,12 @@ namespace orc {
 if (stream == nullptr)
   throw ParseError("LENGTH stream not found in List column");
 rle = createRleDecoder(std::move(stream), false, vers, memoryPool);
-const Type& childType = *type.getSubtype(0);
-if (selectedColumns[static_cast(childType.getColumnId())]) {
-  child = buildReader(childType, stripe);
+ArrayReadIntent intent = stripe.getReadIntent(columnId);
+if (intent == ArrayReadIntent_ALL) {
+  const Type  = *type.getSubtype(0);
+  if (selectedColumns[static_cast(childType.getColumnId())]) {
+child = buildReader(childType, stripe);
+  }

Review comment:
   I replace this with 4 new tests at `c++/test/TestReader.cc`.
   However, they stop at column selection verification and do not read a row 
batch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780620049



##
File path: c++/src/Reader.cc
##
@@ -290,6 +291,15 @@ namespace orc {
 return selectedColumns;
   }
 
+  ArrayReadIntent RowReaderImpl::getReadIntent(uint64_t typeId) const {
+auto elem = readIntents.find(typeId);
+if (elem == readIntents.end()) {
+  return ArrayReadIntent_ALL;
+} else {
+  return elem->second;

Review comment:
   Created 4 new test:
   
   - TestReadIntent.testListAll
   - TestReadIntent.testListOffsets
   - TestReadIntent.testListAllAndOffsets
   - TestReadIntent.testListConflictingIntent
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780619777



##
File path: c++/include/orc/Reader.hh
##
@@ -565,6 +591,15 @@ namespace orc {
  */
 virtual void seekToRow(uint64_t rowNumber) = 0;
 
+/**
+ * Get an ArrayReadIntent for a given typeId.
+ * @param typeId the type id to look up.
+ * @return an ArrayReadIntent that was specified for given typeId through
+ *RowReaderOptions::includeTypesWithIntents. If no ArrayReadIntent was
+ *specified for typeId, return ArrayReadIntent_ALL.
+ */
+virtual ArrayReadIntent getReadIntent(uint64_t typeId) const = 0;

Review comment:
   Deleted in latest commit.

##
File path: c++/src/ColumnReader.cc
##
@@ -993,6 +993,9 @@ namespace orc {
 void nextInternal(ColumnVectorBatch& rowBatch,
   uint64_t numValues,
   char *notNull);
+
+void readArrayIndices(const int64_t *arrayOffsets, uint64_t arrayCount,

Review comment:
   Deleted in latest commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780619681



##
File path: c++/include/orc/Reader.hh
##
@@ -266,6 +286,12 @@ namespace orc {
  * Get desired timezone to return data of timestamp type
  */
 const std::string& getTimezoneName() const;
+
+/**
+ * Get the  map that was supplied through

Review comment:
   Deleted in latest commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780619653



##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,26 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of  used as a parameter in
+ * RowReaderOptions::includeTypesWithIntents.
+ */
+typedef std::map TypeReadIntents;
+
+/**
+ * Selects which type ids to read and specific ArrayReadIntents for each
+ * type id. The root type is always 0 and the rest of the types are
+ * labeled in a preorder traversal of the tree. The parent types are
+ * automatically selected, but the children are not.
+ *
+ * This option clears any previous setting of the selected columns or
+ * types.
+ * @param typesWithIntents a map of .

Review comment:
   Deleted in latest commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780619470



##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,26 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of  used as a parameter in
+ * RowReaderOptions::includeTypesWithIntents.

Review comment:
   Deleted in latest commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-07 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r780619426



##
File path: c++/include/orc/Common.hh
##
@@ -122,6 +122,11 @@ namespace orc {
 StreamKind_BLOOM_FILTER_UTF8 = 8
   };
 
+  enum ArrayReadIntent {

Review comment:
   Renamed back to `ReadIntent`, gave description comments, and TODO for 
MAP and UNION type.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-05 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r779228415



##
File path: c++/src/Options.hh
##
@@ -174,20 +175,36 @@ namespace orc {
 privateBits->selection = ColumnSelection_FIELD_IDS;
 privateBits->includedColumnIndexes.assign(include.begin(), include.end());
 privateBits->includedColumnNames.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::include(const std::list& 
include) {
 privateBits->selection = ColumnSelection_NAMES;
 privateBits->includedColumnNames.assign(include.begin(), include.end());
 privateBits->includedColumnIndexes.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::includeTypes(const std::list& 
types) {
 privateBits->selection = ColumnSelection_TYPE_IDS;
 privateBits->includedColumnIndexes.assign(types.begin(), types.end());
 privateBits->includedColumnNames.clear();
+privateBits->readIntents.clear();
+return *this;
+  }
+
+  RowReaderOptions&
+  RowReaderOptions::includeTypesWithIntents(const TypeReadIntents& 
typesAndIntents) {
+privateBits->selection = ColumnSelection_TYPE_IDS;
+privateBits->includedColumnIndexes.clear();
+privateBits->readIntents.clear();
+for (const auto& typeIntentPair : typesAndIntents ) {
+  privateBits->readIntents[typeIntentPair.first] = typeIntentPair.second;

Review comment:
   I think I missed a case here.
   Suppose we select two nested list columns, both with 
`ArrayReadIntent_OFFSETS`. We need to override the parent list intent to 
`ArrayReadIntent_ALL` in `ColumnSelector::selectParents()`. Otherwise, the 
children list will not be read.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-05 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r779226390



##
File path: c++/include/orc/Reader.hh
##
@@ -565,6 +591,15 @@ namespace orc {
  */
 virtual void seekToRow(uint64_t rowNumber) = 0;
 
+/**
+ * Get an ArrayReadIntent for a given typeId.
+ * @param typeId the type id to look up.
+ * @return an ArrayReadIntent that was specified for given typeId through
+ *RowReaderOptions::includeTypesWithIntents. If no ArrayReadIntent was
+ *specified for typeId, return ArrayReadIntent_ALL.
+ */
+virtual ArrayReadIntent getReadIntent(uint64_t typeId) const = 0;

Review comment:
   My plan is to return `ArrayReadIntent_ALL` if typeId is not selected or 
selected without any specific read intent (in case 
`RowReaderOptions::includeTypes()` is being used). I understand the naming is a 
bit confusing. Will think more about this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-05 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r779222655



##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,26 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of  used as a parameter in
+ * RowReaderOptions::includeTypesWithIntents.
+ */
+typedef std::map TypeReadIntents;
+
+/**
+ * Selects which type ids to read and specific ArrayReadIntents for each
+ * type id. The root type is always 0 and the rest of the types are
+ * labeled in a preorder traversal of the tree. The parent types are
+ * automatically selected, but the children are not.
+ *
+ * This option clears any previous setting of the selected columns or
+ * types.
+ * @param typesWithIntents a map of .
+ * @return this
+ */
+RowReaderOptions&
+includeTypesWithIntents(const TypeReadIntents& typesWithIntents);

Review comment:
   Suppose a customer table as described at 
https://impala.apache.org/docs/build/html/topics/impala_complex_types.html#complex_sample_schema
   
   And a query
   ```
   SELECT
 c.c_name,
 o.o_orderkey,
 l.pos
   FROM customer c
 INNER JOIN c.c_orders o
 INNER JOIN o.o_lineitems l
   LIMIT 5
   ```
   We need to materialize `c.c_orders` (read all) to read `o.o_orderkey` but 
can omit materialization of `o.o_lineitems` elements (read offsets only) since 
we only need `o.o_lineitems.pos`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-05 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r779218710



##
File path: c++/src/Reader.cc
##
@@ -290,6 +291,15 @@ namespace orc {
 return selectedColumns;
   }
 
+  ArrayReadIntent RowReaderImpl::getReadIntent(uint64_t typeId) const {
+auto elem = readIntents.find(typeId);
+if (elem == readIntents.end()) {
+  return ArrayReadIntent_ALL;
+} else {
+  return elem->second;

Review comment:
   The TestColumnReaderEncoded.testListOffsetsOnly that I add only mock up 
to StripeStreamsImpl::getReadIntent(), and I believe it didn't touch this part.
   
   I will try to add more comprehensive tests. Do you know any existing unit 
tests I can refer to?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-05 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r779216881



##
File path: c++/src/ColumnReader.cc
##
@@ -1006,9 +1009,12 @@ namespace orc {
 if (stream == nullptr)
   throw ParseError("LENGTH stream not found in List column");
 rle = createRleDecoder(std::move(stream), false, vers, memoryPool);
-const Type& childType = *type.getSubtype(0);
-if (selectedColumns[static_cast(childType.getColumnId())]) {
-  child = buildReader(childType, stripe);
+ArrayReadIntent intent = stripe.getReadIntent(columnId);
+if (intent == ArrayReadIntent_ALL) {
+  const Type  = *type.getSubtype(0);
+  if (selectedColumns[static_cast(childType.getColumnId())]) {
+child = buildReader(childType, stripe);
+  }

Review comment:
   It is tested at 
[TestColumnReaderEncoded.testListOffsetsOnly](https://github.com/apache/orc/pull/990/files#diff-e0eca596eb78b0688bdb1d04e6a23d5b7c101ff686e99050dbe15147d275ec47R1474),
 which is modified from TestColumnReaderEncoded.testList.
   
   The assertion is
   ```
   ASSERT_EQ(0, longs->numElements);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-05 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r779214232



##
File path: c++/include/orc/Common.hh
##
@@ -122,6 +122,11 @@ namespace orc {
 StreamKind_BLOOM_FILTER_UTF8 = 8
   };
 
+  enum ArrayReadIntent {

Review comment:
   Yes, we can extend this enum to support MAP and UNION type. However, I'd 
like to focus on LIST first in this PR, if it is OK. I'd be happy to extend it 
in the future JIRA.
   
   Renaming this enum back to 'ReadIntent' sounds better if we anticipate 
extending it for MAP and UNION.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-04 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r778486850



##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,28 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of > used as a parameter in
+ * RowReaderOptions::includeTypesAndIntents.
+ */
+typedef std::map> TypeReadIntents;

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-04 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r778486781



##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,28 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of > used as a parameter in
+ * RowReaderOptions::includeTypesAndIntents.
+ */
+typedef std::map> TypeReadIntents;
+
+/**
+ * Selects which type ids to read and specific ReadIntents for each
+ * type id. The root type is always 0 and the rest of the types are
+ * labeled in a preorder traversal of the tree. The parent types are
+ * automatically selected, but the children are not. If a type id is
+ * selected with an empty ReadIntent set, the empty set will be replaced
+ * by {ReadIntent_DATA}.
+ *
+ * This option clears any previous setting of the selected columns or
+ * types.
+ * @param typesAndIntents a map of >.
+ * @return this
+ */
+RowReaderOptions&
+includeTypesAndIntents(const TypeReadIntents& typesAndIntents);

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-04 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r778486717



##
File path: c++/include/orc/Reader.hh
##
@@ -149,6 +149,28 @@ namespace orc {
  */
 RowReaderOptions& includeTypes(const std::list& types);
 
+/**
+ * A map of > used as a parameter in
+ * RowReaderOptions::includeTypesAndIntents.
+ */
+typedef std::map> TypeReadIntents;
+
+/**
+ * Selects which type ids to read and specific ReadIntents for each
+ * type id. The root type is always 0 and the rest of the types are
+ * labeled in a preorder traversal of the tree. The parent types are
+ * automatically selected, but the children are not. If a type id is
+ * selected with an empty ReadIntent set, the empty set will be replaced
+ * by {ReadIntent_DATA}.

Review comment:
   Now we default to ArrayReadIntent_ALL




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-04 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r778486531



##
File path: c++/src/Options.hh
##
@@ -174,19 +175,39 @@ namespace orc {
 privateBits->selection = ColumnSelection_FIELD_IDS;
 privateBits->includedColumnIndexes.assign(include.begin(), include.end());
 privateBits->includedColumnNames.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::include(const std::list& 
include) {
 privateBits->selection = ColumnSelection_NAMES;
 privateBits->includedColumnNames.assign(include.begin(), include.end());
 privateBits->includedColumnIndexes.clear();
+privateBits->readIntents.clear();
 return *this;
   }
 
   RowReaderOptions& RowReaderOptions::includeTypes(const std::list& 
types) {
 privateBits->selection = ColumnSelection_TYPE_IDS;
 privateBits->includedColumnIndexes.assign(types.begin(), types.end());
+privateBits->readIntents.clear();
+privateBits->includedColumnNames.clear();
+privateBits->readIntents.clear();

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-04 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r778441116



##
File path: c++/include/orc/Common.hh
##
@@ -122,6 +122,11 @@ namespace orc {
 StreamKind_BLOOM_FILTER_UTF8 = 8
   };
 
+  enum ReadIntent {

Review comment:
   Following the discussion in 
[Vector.hh](https://github.com/apache/orc/pull/990/files/155cf1a171bcb498731bc95c4a939a445fa23114#r777800353),
 I am thinking to rewrite this enum as follow
   
   ```
 enum ArrayReadIntent {
   ArrayReadIntent_ALL = 0,
   ArrayReadIntent_OFFSETS = 1
 };
   ```
   
   So the choice is either to read all (data+offsets) or offsets only (we can 
not read data without reading offsets).
   
   Other than ListColumnReader, it looks like MapColumnReader and 
UnionColumnReader can benefit from this selective read intent, such as when 
dealing with "select count(*)" kind of query where no materialization is 
required. This is out of the scope of ORC-450 though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [orc] rizaon commented on a change in pull request #990: ORC-450: Add option to avoid materialization of array elements

2022-01-04 Thread GitBox


rizaon commented on a change in pull request #990:
URL: https://github.com/apache/orc/pull/990#discussion_r778437357



##
File path: c++/include/orc/Vector.hh
##
@@ -182,12 +182,20 @@ namespace orc {
 uint64_t getMemoryUsage();
 bool hasVariableLength();
 
+bool hasElements = false;
+bool hasPositions = false;
+
 /**
  * The offset of the first element of each list.
  * The length of list i is offsets[i+1] - offsets[i].
  */
 DataBuffer offsets;
 
+/**
+ * Position of each element in their respective array.
+ */
+DataBuffer pos;

Review comment:
   Agree, ListColumnReader can just populate the offsets and leave the pos 
calculation done by the client. Will remove this field.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org