Anonyfox edited a comment on issue #9420: URL: https://github.com/apache/arrow/issues/9420#issuecomment-773991855
Maybe I just took some wrong directions, therefore I'll outline what I want to achieve with some snippets. So, this resembles my basic struct: ``` pub struct Book { /// NOT unique, but known to be ascii with fixed length of 10 pub code: String, pub authors: u8, pub reviewers: u8, pub title: String, pub keywords: Vec<String>, pub price: f32, pub is_english: bool, // ... more fields of mostly Strings } ``` And I have a few millions of them, as `Vec<Book>`. Now to illustrate the use case a bit more, I have a Library like this: ``` pub struct Library { // will be updated when purchasing/... happens local_books: Vec<Book>, // will be updated periodically from a remote source external_books: Vec<Book>, } ``` In my application there exists a lazy static `RwLock<Library>`, which receives many read queries per second (_"how many books written by 3 authors have you?"_) where I have to lookup in `local_books` and if nothing is found there perform the same lookup in `external_books`, returning the final result. Works okayish, but consumes quite a bit of RAM and I have to build "queries" manually using chains of `iter()`/`filter()`. My thought now was that I can leverage Arrow and it's friends to make this thing more efficient, at least not slower, and have dynamic SQL-ish queries for easier development. Now after playing around even more, especially with datafusion, I got a working state of this (leveraging RecordBatches as you hinted): ``` pub struct Library { local_books: MemTable, external_books: MemTable, } ``` where the table schema looks like this for both MemTables: ``` vec![ Field::new("code", DataType::Utf8, false), // fixed size binary seems to not work yet in datafusion Field::new("authors", DataType::Int8, false), // u8 -> i8 cast is safe in my case Field::new("reviewers", DataType::Int8, false), // u8 -> i8 cast is safe in my case Field::new("title", DataType::Utf8, false), Field::new("keywords", DataType::Utf8, true), // joined the Vec<String> with `',`', couldn't get ListTable to work // ... ] ``` Now the hard wall I faced: actually querying the table! I tried something like this: ``` impl Library { pub async fn count(&self) -> Result<usize> { let mut ctx = ExecutionContext::new(); ctx.register_table("local_books", Box::new(self.local)); // doesn't work, attempting move let sql = "SELECT COUNT(*) FROM local_books"; let results = ctx.sql(&sql)?.collect().await?; // ... returning the count todo!() } } ``` - the `ExecutionContext` needs to be mutable, so I must generate a new one for each call (or: must do once I have dynamically parameterized calls) - `register_table()` only accepts `dyn TableProvider`, not a reference or an `Arc` or anything else that would not need deep cloning - my tables are quite large, so I really want to reuse them once generated, and they might be replaced (outer RwLock) while the application is running - caching things on disk is trivial with serde for `Vec<Book>` but requires much disk space, and while _writing_ is allowed to take a few seconds without issue, _reading_ the dump must be really fast - I found literally zero hints how to serialize/deserialize the MemTable from/to parquet in the docs/examples ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org