[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

GitBox Mon, 01 Feb 2021 17:52:10 -0800


westonpace commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r568266959




##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -625,9 +621,17 @@ class StructDictionary {
 
  private:
   Status AddOne(Datum column, std::shared_ptr<Int32Array>* fused_indices) {
-    if (column.type()->id() != Type::DICTIONARY) {
-      ARROW_ASSIGN_OR_RAISE(column, 
compute::DictionaryEncode(std::move(column)));
+    if (column.type()->id() == Type::DICTIONARY) {
+      // compute::DictionaryEncode doesn't support dictionary and, even if it 
did, it
+      // would be a null op and return a flat dictionary.  In order to group 
by dictionary
+      // we would need to be able to create a nested dictionary.
+      return Status::NotImplemented(
+          "Cannot use column of type dictionary as grouping criteria");

Review comment:
       Yes, I keep thinking of dictionary arrays as arrays of dictionaries 
[{"a": 1, "b": 2}, ...] and not simply a different encoding.  My brain was off 
on a tangent when I wrote this column.
   
   I added the logic to encode nulls by decoding (casting) the dictionary first 
and then re-encoding.  I also added a test case for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Reply via email to