[GitHub] [arrow-datafusion] Cheappie commented on issue #4533: FileStream requires fake ObjectStore when ParquetFileReaderFactory is used

GitBox Tue, 13 Dec 2022 05:30:35 -0800


Cheappie commented on issue #4533:
URL: 
https://github.com/apache/arrow-datafusion/issues/4533#issuecomment-1348552107


   > But we could change that? All the FileOpener are constructed in a context 
that could resolve the object store if it wanted to. All that would change is 
the logic would move out of FileStream::new.
   
   It should be possible, if we would make FileOpener open accept `open(..., 
ctx: Arc<TaskContext>, object_store_url: ..., file_meta: FileMeta`. Then each 
FileOpener could fetch ObjectStore on It's own.
   
   >> DataFusion could have Its own trait for ObjectStore read(get) operations, 
that exposes low-level interface and simplifies integration.
   
   > No objection on principle, but I'm sceptical that introducing more 
indirection is necessary nor desirable. We already have far more factories, 
provider, etc... than I think is strictly necessary and it makes reasoning 
about the code incredibly hard.
   
   That route should hopefully unify `ParquetFileReaderFactory` and 
`ObjectStore`. While enabling both low-level and high-level setups. I mean all 
usages of both of these would be superseded by `AsyncFileReaderFactory` in 
internals of DataFusion. But on the surface that should be possible to either 
pass `AsyncFileReaderFactory` or `ObjectStore` that would be wrapped by some 
proxy impl of `AsyncFileReaderFactory`. The only place where `ObjectStore` 
would be still necessary is for spills (write ops), am I right ?
   
   >> Could you tell me what are best practices for interacting with 
SessionContext ?
   
   > I don't honestly know, I believe @alamb is currently working on making the 
state/config slightly less impenetrable.
   
   Is there an open issue/pr for that work ? Maybe I could suggest adding 
unique id to identify requests in TableProvider scan operations. 
   
   >> What do you think about moving schema inference into scan and removing it 
from TableProvider trait ?
   
   > I don't think this is possible, as planning needs to know the schema. In 
general though performing schema inference per query is very expensive, 
especially for non-parquet data. I strongly recommend investing in some sort of 
catalog to store this data.
   
   I thought that maybe schema could be kept(cached) in TableProvider 
implementations but It would be exposed only through scan operation. For 
example FileScanConfig holds a reference to the schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Cheappie commented on issue #4533: FileStream requires fake ObjectStore when ParquetFileReaderFactory is used

Reply via email to