cdmikechen commented on PR #989:
URL: https://github.com/apache/submarine/pull/989#issuecomment-1252412078

   @FatalLin 
   After https://github.com/apache/submarine/pull/994 have been merged, there 
is a flie confilict. After the conflict is resolved, I think this PR can be 
merged first.
   
   Meanwhile I reconsidered later today after the meeting and I think it should 
be possible to adapt this prehandler operation by adding a `load_dataset` 
method to the `submarine-sdk`.
   For example, we could modify quickstart to look like this:
   ```python
   hdfs_config = {'dfs.nameservices': 'example-cluster', 
'dfs.ha.namenodes.example-cluster': 'nn1,nn2', ...}
   dataset = submarine.load_datasets('hdfs', hdfs_config, 
'hdfs://warehouse/datasets/***.parquet')
   ```
   
   The underlying implementation of `load_dataset` registers the prehandler 
service as pod, and performs the dataset loading\training of the experiment 
after successful data copying. In distributed mode, we can do blocking before 
the pod is finished, so that each worker can wait for the data copying to 
complete.
   I will follow up later to see how `kubeflow` does distributed dataset 
loading. In the meantime, there are some good ideas from another project 
[huggingface-datasets](https://github.com/huggingface/datasets) that I think we 
should learn from (huggingface also seems to download datasets locally first).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to