alamb commented on code in PR #7287: URL: https://github.com/apache/arrow-datafusion/pull/7287#discussion_r1294997037
########## docs/source/library-user-guide/custom-table-providers.md: ########## @@ -0,0 +1,159 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Custom Table Provider + +Like other areas of DataFusion, you extend DataFusion's functionality by implementing a trait. The `TableProvider` and associated traits, have methods that allow you to implement a custom table provider, i.e. use DataFusion's other functionality with your custom data source. + +This section will also touch on how to have DataFusion use the new `TableProvider` implementation. + +## Table Provider and Scan + +the `scan` method on the `TableProvider` is arguably its most important. It returns an execution plan that DataFusion will use as part of the query plan to execute the query. + +### Scan + +As mentioned, `scan` returns an execution plan, and in particular a `Result<Arc<dyn ExecutionPlan>>`. The core of this is returning something that can be dynamically dispatched to an `ExecutionPlan`. And as per the general DataFusion idea, we'll need to implement it. + +#### Execution Plan + +The `ExecutionPlan` trait at its core is a way to get a stream of batches. The aptly-named `execute` method returns a `Result<SendableRecordBatchStream>`, which should be a stream of `RecordBatch`es that can be sent across threads, and has a schema that matches the data to be contained in those batches. + +Looking at the [example in this repo][ex], the execute method: + +```rust +struct CustomExec { + db: CustomDataSource, + projected_schema: SchemaRef, +} + +impl ExecutionPlan for CustomExec { + fn execute( + &self, + _partition: usize, + _context: Arc<TaskContext>, + ) -> Result<SendableRecordBatchStream> { + let users: Vec<User> = { + let db = self.db.inner.lock().unwrap(); + db.data.values().cloned().collect() + }; + + let mut id_array = UInt8Builder::with_capacity(users.len()); + let mut account_array = UInt64Builder::with_capacity(users.len()); + + for user in users { + id_array.append_value(user.id); + account_array.append_value(user.bank_account); + } + + Ok(Box::pin(MemoryStream::try_new( + vec![RecordBatch::try_new( + self.projected_schema.clone(), + vec![ + Arc::new(id_array.finish()), + Arc::new(account_array.finish()), + ], + )?], + self.schema(), + None, + )?)) + } +} +``` + +This: + +1. Gets the users from the database +2. Constructs the individual arrays Review Comment: ```suggestion 2. Constructs the individual output arrays (columns) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
