[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

GitBox Mon, 11 Jan 2021 07:46:09 -0800


pitrou commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555126282




##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -230,9 +236,9 @@ std::vector<KeyValuePartitioning::Key> 
DirectoryPartitioning::ParseKeys(
   return keys;
 }
 
-inline util::optional<int> NextValid(const std::vector<Scalar*>& values, int 
first_null) {
+inline util::optional<int> NextValid(const ScalarVector& values, int 
first_null) {
   auto it = std::find_if(values.begin() + first_null + 1, values.end(),
-                         [](Scalar* v) { return v != nullptr; });
+                         [](const std::shared_ptr<Scalar> v) { return v != 
nullptr; });

Review comment:
       `const&`?

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       Can you comment on this? It's not obvious why we're limited by the size 
of an `int16_t`.

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -288,16 +288,58 @@ class ARROW_DS_EXPORT PartitioningOrFactory {
 /// \brief Assemble lists of indices of identical rows.
 ///
 /// \param[in] by A StructArray whose columns will be used as grouping 
criteria.
-/// \return A StructArray mapping unique rows (in field "values", represented 
as a
-///         StructArray with the same fields as `by`) to lists of indices where
-///         that row appears (in field "groupings").
+///               Top level nulls are invalid, as are empty criteria (no 
grouping
+///               columns).
+/// \return A array of type `struct<values: by.type, groupings: list<int64>>`,
+///         which is a mapping from unique rows (field "values") to lists of
+///         indices into `by` where that row appears (field "groupings").
+///
+/// For example,
+///   MakeGroupings([
+///       {"a": "ex",  "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "why", "b": 0},
+///       {"a": "ex",  "b": 0},
+///       {"a": "why", "b": 1}
+///   ]) == [
+///       {"values": {"a": "ex",  "b": 0}, "groupings": [0, 1, 4]},
+///       {"values": {"a": "why", "b": 0}, "groupings": [2, 3]},
+///       {"values": {"a": "why", "b": 1}, "groupings": [5]}
+///   ]
 ARROW_DS_EXPORT
 Result<std::shared_ptr<StructArray>> MakeGroupings(const StructArray& by);
 
-/// \brief Produce slices of an Array which correspond to the provided 
groupings.
+/// \brief Produce a ListArray whose slots are slices of `array` which 
correspond to the

Review comment:
       Perhaps unimportant, but at least in Python a slice is a contiguous 
subsequence. That said, the example dissipates the ambiguity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Reply via email to