Re: [I] [FEATURE] Support query more file format and specific schema data under Fileset in Spark [gravitino]

via GitHub Thu, 24 Oct 2024 03:28:40 -0700


jerryshao commented on issue #5226:
URL: https://github.com/apache/gravitino/issues/5226#issuecomment-2434854889


   >If users want to read a specific file format that is not officially 
supported by Spark under Fileset through SQL, there is currently no way.
   If the file format does not support schema inference, additional schema 
metadata is required.
   
   Based on the description, is it more like a file format problem, not a 
fileset problem? Current fileset addresses the file path and read/write 
problem, but it doesn't know how to explain the IO stream, so that's why we 
need to use different "format" to explain the IO stream. 
   
   The problem here is more like adding custom "format" support, am I right? If 
so, what I'm thinking is that:
   
   1. Spark can support custom file formats other than built-in ones using 
DSv2, for example, like sequence file, so we can add customized DSv2 support to 
support different formats.
   2. Some file formats are not self-explaining, where to define and store the 
schema is a question.
   3. Fileset is more like a concept of "container", it may have different 
files with different formats in one fileset, if we want to store "schema" in 
fileset and use it, we have to make sure that all files follow the same 
schema/format, otherwise it will be failed to read.
   
   So I was thinking of building an abstract on top of fileset, like @iodone 
mentioned, `Dataset` with "schemas", and engines can directly read/write this 
"Dataset" by explicitly/implicitly inferring the "schema", this seems make more 
sense compared to directly use "fileset" with schema enforced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [FEATURE] Support query more file format and specific schema data under Fileset in Spark [gravitino]

Reply via email to