[ https://issues.apache.org/jira/browse/SUBMARINE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
cdmikechen resolved SUBMARINE-1278. ----------------------------------- Resolution: Done > Fetching data to k8s cluster before experiment's execution > ---------------------------------------------------------- > > Key: SUBMARINE-1278 > URL: https://issues.apache.org/jira/browse/SUBMARINE-1278 > Project: Apache Submarine > Issue Type: New Feature > Reporter: Yu-Tang Lin > Assignee: Yu-Tang Lin > Priority: Major > > Per the discussion with Didi's users, > they think it might be useful if submarine fetches the data from external > file system(ex: hdfs, s3... etc) into cluster first and then the following > executions could read the data from local environment. > After couple discussions, we have couple proposes for above scenario. > # Once the external data source had been set, submarine launches another > container as initialize container of experiment; In this container, we > leverage fsspec to fetch data, and persist into apache arrow, then the > workers in the execution read the data from arrow directly; In the > termination phase, submarine launches another container to clean up the data > in arrow. > # the flow is quite similar to option1, instead of we replace fsspec with > alluxio. But due to we're not a hybird cloud environment focusing solution, I > think the tech stack of alluxio is too thick for us, so I prefer the option > 1 more. > About the external file system integrating, we'll try to integrate hdfs(w/ > kerberos) as our first step. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org For additional commands, e-mail: dev-h...@submarine.apache.org