jiwq opened a new issue, #4104: URL: https://github.com/apache/gravitino/issues/4104
### Describe the proposal ## What’s Dataset? A dataset is a collection of data. In the case of tabular data, a dataset is a generic dataset used to describe any data stored in rows and columns, where the rows represent an example and the columns represent a feature (can be continuous or categorical). For training a deep learning model, the dataset maybe should be split to train and test. In general, the train dataset used in the training stage and the test dataset used in eval stage. ## Dataset Object There are two types of dataset objects, a regular Dataset and then an IterableDataset. A Dataset provides fast random access to the rows, and memory-mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won’t even fit on disk or in memory, an IterableDataset allows you to access and use the dataset without waiting for it to download completely! Split dataset represents a dictionary, key is the split name and value is Dataset object. ### Task list - [ ] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
