xunliu opened a new issue, #4218: URL: https://github.com/apache/gravitino/issues/4218
### Describe the proposal Datassets is a library for easily accessing and sharing Tabular structured data and data sets for non-Tabular audio, computer vision, and natural language processing (NLP) tasks. For training a deep learning model, the dataset may be split to train and test. In general, the training dataset is used in the training stage and the test dataset is used in the eval stage. ## 1.1 Dataset Object There are two types of dataset objects, a regular Dataset and then an IterableDataset. A Dataset provides fast random access to the rows, and memory mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won’t even fit on disk or in memory, an IterableDataset allows you to access and use the dataset without waiting for it to download completely! Split dataset represents a dictionary, the key is the split name and the value is the Dataset object. ## 1.2 Split As described above, the datasets are typically split into different sub-datasets to be used at various stages for model training. Such as: training, testing and evaluation. # 2. Create Dataset Before supporting these features, Gravitino should support the meta management for model training and access control features. The following feature design is based on the above assumptions. <img width="936" alt="image" src="https://github.com/user-attachments/assets/57dc6922-67ee-44ed-aa51-bd3cbf0e8193"> # 3. Load Dataset Wherever a dataset is stored, the Gravitino Datasets should help the user to load it from Apache Gravitino. So we propose the architecture for loading datasets in the Gravitino Datasets library as outlined below: <img width="929" alt="image" src="https://github.com/user-attachments/assets/ed8f40fe-8a6f-4268-bfa4-a3d0186f89aa"> ## 3.1 Catalog Load the dataset from Gravitino should use the granted token. Gravitino Datasets library gets the metadata from Gravitino and generates the sub-dataset for the user. <img width="936" alt="image" src="https://github.com/user-attachments/assets/969aa6ba-d2ef-4ad1-8f05-c723a280132b"> # Design Document 1. https://docs.google.com/document/d/1_gMfkiwc4T56xtE0ZRpla_yD09hqf2MSHKAsWbK-eSc/edit 2. https://docs.google.com/document/d/1NdHc52U6tW9acHNWOfGiCEr08XO6VlcHf1q-n8mD60w/edit ### Task list - [ ] Dataset Object - [ ] Load catalog from Gravitino - [ ] Using Datasets with TensorFlow - [ ] Use with PyTorch - [ ] Use with Spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
