Re: [PR] Support Partitioning Data by Dictionary Encoded String Array Types [arrow-datafusion]

via GitHub Mon, 23 Oct 2023 13:44:11 -0700


alamb commented on code in PR #7896:
URL: https://github.com/apache/arrow-datafusion/pull/7896#discussion_r1369238266



##########
datafusion/sqllogictest/test_files/insert_to_external.slt:
##########
@@ -40,8 +40,73 @@ STORED AS CSV
 WITH HEADER ROW
 LOCATION '../../testing/data/csv/aggregate_test_100.csv'
 
-# test_insert_into
 
+statement ok
+CREATE EXTERNAL TABLE dictionary_encoded_parquet

Review Comment:
   I don't see any "dictionary" here -- why is it called 
`dictionary_encoded_parquet`?



##########
datafusion/sqllogictest/test_files/insert_to_external.slt:
##########
@@ -40,8 +40,73 @@ STORED AS CSV
 WITH HEADER ROW
 LOCATION '../../testing/data/csv/aggregate_test_100.csv'
 
-# test_insert_into
 
+statement ok
+CREATE EXTERNAL TABLE dictionary_encoded_parquet
+STORED AS parquet
+LOCATION '../../parquet-testing/data/alltypes_dictionary.parquet';
+
+query TTT
+describe dictionary_encoded_parquet;
+----
+id Int32 YES
+bool_col Boolean YES
+tinyint_col Int32 YES
+smallint_col Int32 YES
+int_col Int32 YES
+bigint_col Int64 YES
+float_col Float32 YES
+double_col Float64 YES
+date_string_col Binary YES
+string_col Binary YES
+timestamp_col Timestamp(Nanosecond, None) YES
+
+statement ok
+CREATE EXTERNAL TABLE dictionary_encoded_parquet_partitioned(
+  id int,
+  bool_col bool,
+  tinyint_col int,
+  smallint_col int,
+  int_col int,
+  bigint_col bigint,
+  float_col float,
+  double_col double,
+  date_string_col varchar,
+  string_col varchar,
+  timestamp_col varchar
+) 
+STORED AS parquet
+LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned'
+PARTITIONED BY (date_string_col, string_col, timestamp_col)
+OPTIONS(
+create_local_path 'true',
+insert_mode 'append_new_files',
+);
+
+query IBIIIIRRTTT
+insert into dictionary_encoded_parquet_partitioned 
+select 
+  id, 
+  bool_col, 
+  tinyint_col,
+  smallint_col,
+  int_col,
+  bigint_col,
+  float_col,
+  double_col,
+  date_string_col,
+  string_col,
+  timestamp_col::TEXT from dictionary_encoded_parquet;
+----
+2
+
+query IBIIIIRRTTT
+select * from dictionary_encoded_parquet_partitioned order by (id);
+----
+0 true 0 0 0 0 0 0 01%2F01%2F09 0 2009-01-01T00:00:00
+1 false 1 1 1 10 1.1 10.1 01%2F01%2F09 1 2009-01-01T00:01:00

Review Comment:
   This might be an easier way to create a data with dictionary encoded data:
   
   ```
   ❯ create table test as values (arrow_cast('foo', 'Dictionary(Int32, 
Utf8)')), (arrow_cast('bar', 'Dictionary(Int32, Utf8)'));
   0 rows in set. Query took 0.002 seconds.
   
   ❯ select arrow_typeof(column1) from test;
   +----------------------------+
   | arrow_typeof(test.column1) |
   +----------------------------+
   | Dictionary(Int32, Utf8)    |
   | Dictionary(Int32, Utf8)    |
   +----------------------------+
   2 rows in set. Query took 0.001 seconds.
   
   ❯ select * from test;
   +---------+
   | column1 |
   +---------+
   | foo     |
   | bar     |
   +---------+
   2 rows in set. Query took 0.000 seconds.
   ```



##########
datafusion/sqllogictest/test_files/insert_to_external.slt:
##########
@@ -40,8 +40,73 @@ STORED AS CSV
 WITH HEADER ROW
 LOCATION '../../testing/data/csv/aggregate_test_100.csv'
 
-# test_insert_into
 
+statement ok
+CREATE EXTERNAL TABLE dictionary_encoded_parquet
+STORED AS parquet
+LOCATION '../../parquet-testing/data/alltypes_dictionary.parquet';
+
+query TTT
+describe dictionary_encoded_parquet;
+----
+id Int32 YES
+bool_col Boolean YES
+tinyint_col Int32 YES
+smallint_col Int32 YES
+int_col Int32 YES
+bigint_col Int64 YES
+float_col Float32 YES
+double_col Float64 YES
+date_string_col Binary YES
+string_col Binary YES
+timestamp_col Timestamp(Nanosecond, None) YES
+
+statement ok
+CREATE EXTERNAL TABLE dictionary_encoded_parquet_partitioned(
+  id int,
+  bool_col bool,
+  tinyint_col int,
+  smallint_col int,
+  int_col int,
+  bigint_col bigint,
+  float_col float,
+  double_col double,
+  date_string_col varchar,
+  string_col varchar,
+  timestamp_col varchar
+) 
+STORED AS parquet
+LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned'
+PARTITIONED BY (date_string_col, string_col, timestamp_col)
+OPTIONS(
+create_local_path 'true',
+insert_mode 'append_new_files',
+);
+
+query IBIIIIRRTTT
+insert into dictionary_encoded_parquet_partitioned 
+select 
+  id, 
+  bool_col, 
+  tinyint_col,
+  smallint_col,
+  int_col,
+  bigint_col,
+  float_col,
+  double_col,
+  date_string_col,
+  string_col,
+  timestamp_col::TEXT from dictionary_encoded_parquet;
+----
+2
+
+query IBIIIIRRTTT
+select * from dictionary_encoded_parquet_partitioned order by (id);
+----
+0 true 0 0 0 0 0 0 01%2F01%2F09 0 2009-01-01T00:00:00
+1 false 1 1 1 10 1.1 10.1 01%2F01%2F09 1 2009-01-01T00:01:00

Review Comment:
   This might be an easier way to create a data with dictionary encoded data:
   
   ```
   ❯ create table test as values (arrow_cast('foo', 'Dictionary(Int32, 
Utf8)')), (arrow_cast('bar', 'Dictionary(Int32, Utf8)'));
   0 rows in set. Query took 0.002 seconds.
   
   ❯ select arrow_typeof(column1) from test;
   +----------------------------+
   | arrow_typeof(test.column1) |
   +----------------------------+
   | Dictionary(Int32, Utf8)    |
   | Dictionary(Int32, Utf8)    |
   +----------------------------+
   2 rows in set. Query took 0.001 seconds.
   
   ❯ select * from test;
   +---------+
   | column1 |
   +---------+
   | foo     |
   | bar     |
   +---------+
   2 rows in set. Query took 0.000 seconds.
   ```



##########
datafusion/common/src/dfschema.rs:
##########
@@ -420,6 +420,11 @@ impl DFSchema {
                 Self::datatype_is_semantically_equal(k1.as_ref(), k2.as_ref())
                     && Self::datatype_is_semantically_equal(v1.as_ref(), 
v2.as_ref())
             }
+            // The next two cases allow for the possibility that one schema 
has a dictionary encoded array

Review Comment:
   I am somewhat worried about this change (as it effectively changes the 
meaning of the function to also include logical types).
   
   I think we should either  update the comments to reflect this change, or (my 
preference) make a new function that is explicit. Perhaps something like :
   
   ```rust
      ///  Returns true of two [`DataType`]s have the same logical data type:
      /// Either the same name and type, or if one is dictionary encoded, then 
      /// the same name and type of the values.
      /// E.g. Dictionary(_, Utf8) is semantically equivalent to Utf8 since 
both represent an array of strings
     fn datatype_has_same_logical_type(
           match (dt1, dt2) {
               (DataType::Dictionary(_, v1), othertype) => v1.as_ref() == 
othertype,
               (othertype, DataType::Dictionary(_, v1)) => v1.as_ref() == 
othertype,
                _ => Self::datatype_is_semantically_equal(dt1, dt2)
           }
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support Partitioning Data by Dictionary Encoded String Array Types [arrow-datafusion]

Reply via email to