Re: [I] [FEATURE] Support query more file format and specific schema data under Fileset in Spark [gravitino]

via GitHub Fri, 25 Oct 2024 00:12:49 -0700


xloya commented on issue #5226:
URL: https://github.com/apache/gravitino/issues/5226#issuecomment-2434980084


   > > If users want to read a specific file format that is not officially 
supported by Spark under Fileset through SQL, there is currently no way.
   > > If the file format does not support schema inference, additional schema 
metadata is required.
   > 
   > Based on the description, is it more like a file format problem, not a 
fileset problem? Current fileset addresses the file path and read/write 
problem, but it doesn't know how to explain the IO stream, so that's why we 
need to use different "format" to explain the IO stream.
   > 
   > The problem here is more like adding custom "format" support, am I right? 
If so, what I'm thinking is that:
   > 
   > 1. Spark can support custom file formats other than built-in ones using 
DSv2, for example, like sequence file, so we can add customized DSv2 support to 
support different formats.
   > 2. Some file formats are not self-explaining, where to define and store 
the schema is a question.
   > 3. Fileset is more like a concept of "container", it may have different 
files with different formats in one fileset, if we want to store "schema" in 
fileset and use it, we have to make sure that all files follow the same 
schema/format, otherwise it will be failed to read.
   > 
   > So I was thinking of building an abstract on top of fileset, like @iodone 
mentioned, `Dataset` with "schemas", and engines can directly read/write this 
"Dataset" by explicitly/implicitly inferring the "schema", this seems make more 
sense compared to directly use "fileset" with schema enforced.
   
   Using `Datasets` to unified the concept sounds like a good idea. If I 
understand correctly, `Datasets` here can be bound to specific schema / file 
format / serialization infos, so that we can implement different forms of API 
based on this concept to read the resource types it supports. The API 
implementation of Datasets here depends entirely on the specific resource type 
and engine interface. For example, if we support Schema on Read for Fileset in 
Spark, we can implement File data source based on Datasets to do this; If we 
support Schema on Read for Table in Spark, we can implement Table data source 
based on Datasets to do this. I don't know if I understand this correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [FEATURE] Support query more file format and specific schema data under Fileset in Spark [gravitino]

Reply via email to