Re: [I] [FEATURE] Support query the unified Fileset Data Source in Spark [gravitino]

via GitHub Wed, 23 Oct 2024 20:12:04 -0700


xloya commented on issue #5226:
URL: https://github.com/apache/gravitino/issues/5226#issuecomment-2434155379


   > @xloya @FANNG1 What about `G-sequence`, `G-tfrecord`, `G-parquet`? Using
   > 
   > ```
   > select * from `G-sequence`.`gvfs://`
   > ```
   > 
   > to read the data.
   > 
   > There is something I want to be clear. Why we need `G-parquet` not a 
`parquet`. The `G-parquet` will bind to a table schema and `parquet` will infer 
schema. Users use this table schema to write data, so we can ensure data 
compatibility because we assume that the table schema evolves correctly. 
Another question is, if we have a table schema, why not write to a table? We 
need a file format, not a table. For example, in the context of machine 
learning, we always use a file format, not a table format.
   
   Yes, I think it is necessary to distinguish from the default file format 
implementation supported by Spark. Users can of course continue to use Spark's 
default implementation, but in our data source, we can provide enhanced 
capabilities. As for the naming of G-parquet, I think it is a little unclear, 
because currently we only provide it for fileset, and I am not sure whether it 
will be used for other resources in the future. If not, we'd better bind it to 
fileset in naming.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [FEATURE] Support query the unified Fileset Data Source in Spark [gravitino]

Reply via email to