[ https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Lamb reassigned ARROW-10159: ----------------------------------- Assignee: Andrew Lamb > [Rust][DataFusion] Add support for Dictionary types in data fusion > ------------------------------------------------------------------ > > Key: ARROW-10159 > URL: https://issues.apache.org/jira/browse/ARROW-10159 > Project: Apache Arrow > Issue Type: New Feature > Reporter: Andrew Lamb > Assignee: Andrew Lamb > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We have a system that need to process low cardinality string data (aka there > are only a few distinct values, but there are many millions of values). > Using a `StringArray` is very expensive as the same string value is copied > over and over again. The `DictionaryArray` was exactly designed to handle > this situatio: rather than repeating each string, it uses indexes into a > dictionary and thus repeats integer values. > Sadly, DataFusion does not support processing on `DictionaryArray` types for > several reasons. > This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I > would like to be possible: > {code} > #[tokio::test] > async fn query_on_string_dictionary() -> Result<()> { > // ensure that data fusion can operate on dictionary types > // Use StringDictionary (32 bit indexes = keys) > let field_type = DataType::Dictionary( > Box::new(DataType::Int32), > Box::new(DataType::Utf8), > ); > let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, > true)])); > let keys_builder = PrimitiveBuilder::<Int32Type>::new(10); > let values_builder = StringBuilder::new(10); > let mut builder = StringDictionaryBuilder::new( > keys_builder, values_builder > ); > builder.append("one")?; > builder.append_null()?; > builder.append("three")?; > let array = Arc::new(builder.finish()); > let data = RecordBatch::try_new( > schema.clone(), > vec![array], > )?; > let table = MemTable::new(schema, vec![vec![data]])?; > let mut ctx = ExecutionContext::new(); > ctx.register_table("test", Box::new(table)); > // Basic SELECT > let sql = "SELECT * FROM test"; > let actual = execute(&mut ctx, sql).await.join("\n"); > let expected = "\"one\"\nNULL\n\"three\"".to_string(); > assert_eq!(expected, actual); > // basic filtering > let sql = "SELECT * FROM test WHERE d1 IS NOT NULL"; > let actual = execute(&mut ctx, sql).await.join("\n"); > let expected = "\"one\"\n\"three\"".to_string(); > assert_eq!(expected, actual); > // filtering with constant > let sql = "SELECT * FROM test WHERE d1 = 'three'"; > let actual = execute(&mut ctx, sql).await.join("\n"); > let expected = "\"three\"".to_string(); > assert_eq!(expected, actual); > // Expression evaluation > let sql = "SELECT concat(d1, '-foo') FROM test"; > let actual = execute(&mut ctx, sql).await.join("\n"); > let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string(); > assert_eq!(expected, actual); > // aggregation > let sql = "SELECT COUNT(d1) FROM test"; > let actual = execute(&mut ctx, sql).await.join("\n"); > let expected = "2".to_string(); > assert_eq!(expected, actual); > Ok(()) > } > {code} > However, it errors immediately: > {code} > ---- query_on_string_dictionary stdout ---- > thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == > right)` > left: `"\"one\"\nNULL\n\"three\""`, > right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code{ > This ticket tracks adding proper support Dictionary types to DataFusion. I > will break the work down into several smaller subtasks -- This message was sent by Atlassian Jira (v8.3.4#803005)