xloya commented on issue #5226: URL: https://github.com/apache/gravitino/issues/5226#issuecomment-2434980084
> > If users want to read a specific file format that is not officially supported by Spark under Fileset through SQL, there is currently no way. > > If the file format does not support schema inference, additional schema metadata is required. > > Based on the description, is it more like a file format problem, not a fileset problem? Current fileset addresses the file path and read/write problem, but it doesn't know how to explain the IO stream, so that's why we need to use different "format" to explain the IO stream. > > The problem here is more like adding custom "format" support, am I right? If so, what I'm thinking is that: > > 1. Spark can support custom file formats other than built-in ones using DSv2, for example, like sequence file, so we can add customized DSv2 support to support different formats. > 2. Some file formats are not self-explaining, where to define and store the schema is a question. > 3. Fileset is more like a concept of "container", it may have different files with different formats in one fileset, if we want to store "schema" in fileset and use it, we have to make sure that all files follow the same schema/format, otherwise it will be failed to read. > > So I was thinking of building an abstract on top of fileset, like @iodone mentioned, `Dataset` with "schemas", and engines can directly read/write this "Dataset" by explicitly/implicitly inferring the "schema", this seems make more sense compared to directly use "fileset" with schema enforced. Using `Datasets` to unified the concept sounds like a good idea. If I understand correctly, `Datasets` here can be bound to specific schema / file format / serialization infos, so that we can implement different forms of API based on this concept to read the resource types it supports. The API implementation of Datasets here depends entirely on the specific resource type and engine interface. For example, if we support Schema on Read for Fileset in Spark, we can implement File data source based on Datasets to do this; If we support Schema on Read for Table in Spark, we can implement Table data source based on Datasets to do this. I don't know if I understand this correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
