[GitHub] [arrow-datafusion] Cheappie commented on issue #4533: FileStream requires fake ObjectStore when ParquetFileReaderFactory is used

GitBox Tue, 13 Dec 2022 04:37:24 -0800


Cheappie commented on issue #4533:
URL: 
https://github.com/apache/arrow-datafusion/issues/4533#issuecomment-1348463841


   > Would it work to just inline the `ObjectStore` and not the `FileMeta`?
   
   Unfortunately not, `FileOpener` requires `ObjectStore` as an argument to 
`open` fn.
   
   There is yet another way, a bit more intrusive but in long term might be 
worthwhile. DataFusion could have Its own trait for `ObjectStore` read(get) 
operations, that exposes low-level interface and simplifies integration. I had 
to rewrite few utility fn's that works only with `ObjectStore`. However that 
interface should allow us to wrap high-level `ObjectStore` and proxy get 
operations.
   
   ```
   trait DFGetObjectStore { // for sure there is a better name for that trait 
e.g. `AsyncFileReaderFactory` :)
       ...get operations
   }
   
   struct DFGetObjectStoreProxyImpl(ObjectStore)
   impl DFGetObjectStore for DFGetObjectStoreProxyImpl { 
       ...proxy get operations
   }
   ```
   
   The reason why I prefer a factory of `FileReader` over `ObjectStore` with 
get operations that produce bytes, is that I can hold file handle for as long 
as file is being read. Otherwise in my case new version of file might arrive 
and parquet metadata that was read a moment ago, wouldn't be valid anymore. 
   
   Could you tell me what are best practices for interacting with 
`SessionContext` ? Should It be created per query ? I wonder because there is 
no per request(query) id, so I cannot identify to what request some query 
belongs from `TableProvider` scan operation. However there is a session_id, so 
It should be possible if I would create new session for each query, but that 
requires cloning SessionState.
   
   What do you think about moving schema inference into scan and removing it 
from `TableProvider` trait ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Cheappie commented on issue #4533: FileStream requires fake ObjectStore when ParquetFileReaderFactory is used

Reply via email to