cdmikechen commented on PR #989: URL: https://github.com/apache/submarine/pull/989#issuecomment-1252412078
@FatalLin After https://github.com/apache/submarine/pull/994 have been merged, there is a flie confilict. After the conflict is resolved, I think this PR can be merged first. Meanwhile I reconsidered later today after the meeting and I think it should be possible to adapt this prehandler operation by adding a `load_dataset` method to the `submarine-sdk`. For example, we could modify quickstart to look like this: ```python hdfs_config = {'dfs.nameservices': 'example-cluster', 'dfs.ha.namenodes.example-cluster': 'nn1,nn2', ...} dataset = submarine.load_datasets('hdfs', hdfs_config, 'hdfs://warehouse/datasets/***.parquet') ``` The underlying implementation of `load_dataset` registers the prehandler service as pod, and performs the dataset loading\training of the experiment after successful data copying. In distributed mode, we can do blocking before the pod is finished, so that each worker can wait for the data copying to complete. I will follow up later to see how `kubeflow` does distributed dataset loading. In the meantime, there are some good ideas from another project [huggingface-datasets](https://github.com/huggingface/datasets) that I think we should learn from (huggingface also seems to download datasets locally first). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
