Anonyfox edited a comment on issue #9420:
URL: https://github.com/apache/arrow/issues/9420#issuecomment-773991855


   Maybe I just took some wrong directions, therefore I'll outline what I want 
to achieve with some snippets. So, this resembles my basic struct: 
   
   ```
   pub struct Book {
       /// NOT unique, but known to be ascii with fixed length of 10
       pub code: String,
       pub authors: u8,
       pub reviewers: u8,
       pub title: String,
       pub keywords: Vec<String>,
       pub price: f32,
       pub is_english: bool,
      // ... more fields of mostly Strings
   }
   ```
   
   And I have a few millions of them, as `Vec<Book>`. Now to illustrate the use 
case a bit more, I have a Library like this: 
   
   ```
   pub struct Library {
       // will be updated when purchasing/... happens
       local_books: Vec<Book>,
       // will be updated periodically from a remote source
       external_books: Vec<Book>,
   }
   ```
   
   In my application there exists a lazy static `RwLock<Library>`, which 
receives many read queries per second (_"how many books written by 3 authors 
have you?"_) where I have to lookup in `local_books` and if nothing is found 
there perform the same lookup in `external_books`, returning the final result. 
   
   Works okayish, but consumes quite a bit of RAM and I have to build "queries" 
manually using chains of `iter()`/`filter()`. 
   
   My thought now was that I can leverage Arrow and it's friends to make this 
thing more efficient, at least not slower, and have dynamic SQL-ish queries for 
easier development. 
   
   Now after playing around even more, especially with datafusion, I got a 
working state of this (leveraging RecordBatches as you hinted): 
   
   ```
   pub struct Library {
       local_books: MemTable,
       external_books: MemTable,
   }
   ```
   
   where the table schema looks like this for both MemTables: 
   
   ```
   vec![
           Field::new("code", DataType::Utf8, false), // fixed size binary 
seems to not work yet in datafusion
           Field::new("authors", DataType::Int8, false), // u8 -> i8 cast is 
safe in my case
           Field::new("reviewers", DataType::Int8, false), // u8 -> i8 cast is 
safe in my case
           Field::new("title", DataType::Utf8, false),
           Field::new("keywords", DataType::Utf8, true), // joined the 
Vec<String> with `',`', couldn't get ListTable to work
           // ...
   ]
   ```
   
   Now the hard wall I faced: actually querying the table! I tried something 
like this: 
   
   ```
   impl Library {
       pub async fn count(&self) -> Result<usize> {
           let mut ctx = ExecutionContext::new();
           ctx.register_table("local_books", Box::new(self.local)); // doesn't 
work, attempting move
           let sql = "SELECT COUNT(*) FROM local_books";
           let results = ctx.sql(&sql)?.collect().await?;
           // ... returning the count
           todo!()
       } 
   }
   ```
   
   - the `ExecutionContext` needs to be mutable, so I must generate a new one 
for each call (or: must do once I have dynamically parameterized calls)
   - `register_table()` only accepts `dyn TableProvider`, not a reference or an 
`Arc` or anything else that would not need deep cloning
   - my tables are quite large, so I really want to reuse them once generated, 
and they might be replaced (outer RwLock) while the application is running
   - caching things on disk is trivial with serde for `Vec<Book>` but requires 
much disk space, and while _writing_ is allowed to take a few seconds without 
issue, _reading_ the dump must be really fast
   - I found literally zero hints how to serialize/deserialize the MemTable 
from/to parquet in the docs/examples


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to