[I] [EPIC] Dataset library [gravitino]

via GitHub Sat, 20 Jul 2024 04:46:43 -0700


xunliu opened a new issue, #4218:
URL: https://github.com/apache/gravitino/issues/4218


   ### Describe the proposal
   
   Datassets is a library for easily accessing and sharing Tabular structured 
data and data sets for non-Tabular audio, computer vision, and natural language 
processing (NLP) tasks.
   
   For training a deep learning model, the dataset may be split to train and 
test. In general, the training dataset is used in the training stage and the 
test dataset is used in the eval stage.
   
   ## 1.1 Dataset Object
   There are two types of dataset objects, a regular Dataset and then an 
IterableDataset. A Dataset provides fast random access to the rows, and memory 
mapping so that loading even large datasets only uses a relatively small amount 
of device memory. But for really, really big datasets that won’t even fit on 
disk or in memory, an IterableDataset allows you to access and use the dataset 
without waiting for it to download completely!
   
   Split dataset represents a dictionary, the key is the split name and the 
value is the Dataset object.
   
   ## 1.2 Split
   As described above, the datasets are typically split into different 
sub-datasets to be used at various stages for model training. Such as: 
training, testing and evaluation. 
   
   # 2. Create Dataset
   Before supporting these features, Gravitino should support the meta 
management for model training and access control features. The following 
feature design is based on the above assumptions.
   <img width="936" alt="image" 
src="https://github.com/user-attachments/assets/57dc6922-67ee-44ed-aa51-bd3cbf0e8193";>
   
   # 3. Load Dataset
   Wherever a dataset is stored, the Gravitino Datasets should help the user to 
load it from Apache Gravitino. So we propose the architecture for loading 
datasets in the Gravitino Datasets library as outlined below:
   
   <img width="929" alt="image" 
src="https://github.com/user-attachments/assets/ed8f40fe-8a6f-4268-bfa4-a3d0186f89aa";>
   
   ## 3.1 Catalog
   Load the dataset from Gravitino should use the granted token. Gravitino 
Datasets library gets the metadata from Gravitino and generates the sub-dataset 
for the user.
   
   
   <img width="936" alt="image" 
src="https://github.com/user-attachments/assets/969aa6ba-d2ef-4ad1-8f05-c723a280132b";>
   
   # Design Document
   1. 
https://docs.google.com/document/d/1_gMfkiwc4T56xtE0ZRpla_yD09hqf2MSHKAsWbK-eSc/edit
   2. 
https://docs.google.com/document/d/1NdHc52U6tW9acHNWOfGiCEr08XO6VlcHf1q-n8mD60w/edit
   
   ### Task list
   
   - [ ] Dataset Object
   - [ ] Load catalog from Gravitino
   - [ ] Using Datasets with TensorFlow
   - [ ] Use with PyTorch
   - [ ] Use with Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [EPIC] Dataset library [gravitino]

Reply via email to