[I] [EPIC] Intro the Datasets library to easily accessing and pushing dataset from Gravitino [gravitino]

via GitHub Mon, 08 Jul 2024 05:09:43 -0700


jiwq opened a new issue, #4104:
URL: https://github.com/apache/gravitino/issues/4104


   ### Describe the proposal
   
   ## What’s Dataset?
   A dataset is a collection of data. In the case of tabular data, a dataset is 
a generic dataset used to describe any data stored in rows and columns, where 
the rows represent an example and the columns represent a feature (can be 
continuous or categorical).
   
   For training a deep learning model, the dataset maybe should be split to 
train and test. In general, the train dataset used in the training stage and 
the test dataset used in eval stage.
   
   ## Dataset Object
   There are two types of dataset objects, a regular Dataset and then an 
IterableDataset. A Dataset provides fast random access to the rows, and 
memory-mapping so that loading even large datasets only uses a relatively small 
amount of device memory. But for really, really big datasets that won’t even 
fit on disk or in memory, an IterableDataset allows you to access and use the 
dataset without waiting for it to download completely!
   
   Split dataset represents a dictionary, key is the split name and value is 
Dataset object.
   
   
   ### Task list
   
   - [ ] 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [EPIC] Intro the Datasets library to easily accessing and pushing dataset from Gravitino [gravitino]

Reply via email to