Re: Spark on K8s - using files fetched by init-container?
Yes you were pointing to HDFS on a loopback address... From: Jenna Hoole Sent: Monday, February 26, 2018 1:11:35 PM To: Yinan Li; user@spark.apache.org Subject: Re: Spark on K8s - using files fetched by init-container? Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running now :) Thank you so much! Jenna On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li mailto:liyinan...@gmail.com>> wrote: OK, it looks like you will need to use `file:///var/spark-data/spark-files/flights.csv` instead. The 'file://' scheme must be explicitly used as it seems it defaults to 'hdfs' in your setup. On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole mailto:jenna.ho...@gmail.com>> wrote: Thank you for the quick response! However, I'm still having problems. When I try to look for /var/spark-data/spark-files/flights.csv I get told: Error: Error in loadDF : analysis error - Path does not exist: hdfs://192.168.0.1:8020/var/spark-data/spark-files/flights.csv<http://192.168.0.1:8020/var/spark-data/spark-files/flights.csv>; Execution halted Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) at org.apache.spark.deploy.RRunner.main(RRunner.scala) And when I try to look for local:///var/spark-data/spark-files/flights.csv, I get: Error in file(file, "rt") : cannot open the connection Calls: read.csv -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'local:///var/spark-data/spark-files/flights.csv': No such file or directory Execution halted Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) at org.apache.spark.deploy.RRunner.main(RRunner.scala) I can see from a kubectl describe that the directory is getting mounted. Mounts: /etc/hadoop/conf from hadoop-properties (rw) /var/run/secrets/kubernetes.io/serviceaccount<http://kubernetes.io/serviceaccount> from spark-token-pxz79 (ro) /var/spark-data/spark-files from download-files (rw) /var/spark-data/spark-jars from download-jars-volume (rw) /var/spark/tmp from spark-local-dir-0-tmp (rw) Is there something else I need to be doing in my set up? Thanks, Jenna On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li mailto:liyinan...@gmail.com>> wrote: The files specified through --files are localized by the init-container to /var/spark-data/spark-files by default. So in your case, the file should be located at /var/spark-data/spark-files/flights.csv locally in the container. On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole mailto:jenna.ho...@gmail.com>> wrote: This is probably stupid user error, but I can't for the life of me figure out how to access the files that are staged by the init-container. I'm trying to run the SparkR example data-manipulation.R which requires the path to its datafile. I supply the hdfs location via --files and then the full hdfs path. --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv> local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv> The init-container seems to load my file. 18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv> at hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv> with timestamp 1519669749519 18/02/26 18:29:09 INFO util.Utils: Fetching hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv> to /var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp However, I get an error that my file does not exist. Error in file(file, "rt") : cannot open the connection Calls: read.csv -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>': No such file or directory Execution halted Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) at org.apache.spark.deploy.RRunner.main(RRunner.scala) If I try supplying just flights.csv, I get a different error --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv> local:///opt/spark/examples/src/main/r/data-man
Re: Spark on K8s - using files fetched by init-container?
Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running now :) Thank you so much! Jenna On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li wrote: > OK, it looks like you will need to use > `file:///var/spark-data/spark-files/flights.csv` > instead. The 'file://' scheme must be explicitly used as it seems it > defaults to 'hdfs' in your setup. > > On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole > wrote: > >> Thank you for the quick response! However, I'm still having problems. >> >> When I try to look for /var/spark-data/spark-files/flights.csv I get >> told: >> >> Error: Error in loadDF : analysis error - Path does not exist: hdfs:// >> 192.168.0.1:8020/var/spark-data/spark-files/flights.csv; >> >> Execution halted >> >> Exception in thread "main" org.apache.spark.SparkUserAppException: User >> application exited with 1 >> >> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) >> >> at org.apache.spark.deploy.RRunner.main(RRunner.scala) >> >> And when I try to look for local:///var/spark-data/spark-files/flights.csv, >> I get: >> >> Error in file(file, "rt") : cannot open the connection >> >> Calls: read.csv -> read.table -> file >> >> In addition: Warning message: >> >> In file(file, "rt") : >> >> cannot open file 'local:///var/spark-data/spark-files/flights.csv': No >> such file or directory >> >> Execution halted >> >> Exception in thread "main" org.apache.spark.SparkUserAppException: User >> application exited with 1 >> >> at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) >> >> at org.apache.spark.deploy.RRunner.main(RRunner.scala) >> >> I can see from a kubectl describe that the directory is getting mounted. >> >> Mounts: >> >> /etc/hadoop/conf from hadoop-properties (rw) >> >> /var/run/secrets/kubernetes.io/serviceaccount from >> spark-token-pxz79 (ro) >> >> /var/spark-data/spark-files from download-files (rw) >> >> /var/spark-data/spark-jars from download-jars-volume (rw) >> >> /var/spark/tmp from spark-local-dir-0-tmp (rw) >> >> Is there something else I need to be doing in my set up? >> >> Thanks, >> Jenna >> >> On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li wrote: >> >>> The files specified through --files are localized by the init-container >>> to /var/spark-data/spark-files by default. So in your case, the file should >>> be located at /var/spark-data/spark-files/flights.csv locally in the >>> container. >>> >>> On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole >>> wrote: >>> This is probably stupid user error, but I can't for the life of me figure out how to access the files that are staged by the init-container. I'm trying to run the SparkR example data-manipulation.R which requires the path to its datafile. I supply the hdfs location via --files and then the full hdfs path. --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv The init-container seems to load my file. 18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv at hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv with timestamp 1519669749519 18/02/26 18:29:09 INFO util.Utils: Fetching hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv to /var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/us erFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp78 72615076522023165.tmp However, I get an error that my file does not exist. Error in file(file, "rt") : cannot open the connection Calls: read.csv -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv': No such file or directory Execution halted Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) at org.apache.spark.deploy.RRunner.main(RRunner.scala) If I try supplying just flights.csv, I get a different error --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv Error: Error in loadDF : analysis error - Path does not exist: hdfs:// 192.168.0.1:8020/user/root/flights.csv; Execution halted Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) at org.apache.spark.deploy.RRunner.main(RRunner.scala) If the path /user/root/flights.csv does exist and I only supply "flights.csv" as the file path, it runs to completion successfully. However, if I provide the file path as "hdfs://19
Re: Spark on K8s - using files fetched by init-container?
The files specified through --files are localized by the init-container to /var/spark-data/spark-files by default. So in your case, the file should be located at /var/spark-data/spark-files/flights.csv locally in the container. On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole wrote: > This is probably stupid user error, but I can't for the life of me figure > out how to access the files that are staged by the init-container. > > I'm trying to run the SparkR example data-manipulation.R which requires > the path to its datafile. I supply the hdfs location via --files and then > the full hdfs path. > > > --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv > local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs:// > 192.168.0.1:8020/user/jhoole/flights.csv > > The init-container seems to load my file. > > 18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs:// > 192.168.0.1:8020/user/jhoole/flights.csv at hdfs://192.168.0.1:8020/user/ > jhoole/flights.csv with timestamp 1519669749519 > > 18/02/26 18:29:09 INFO util.Utils: Fetching hdfs://192.168.0.1:8020/user/ > jhoole/flights.csv to /var/spark/tmp/spark-d943dae6- > 9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9- > bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp > > However, I get an error that my file does not exist. > > Error in file(file, "rt") : cannot open the connection > > Calls: read.csv -> read.table -> file > > In addition: Warning message: > > In file(file, "rt") : > > cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv': No > such file or directory > > Execution halted > > Exception in thread "main" org.apache.spark.SparkUserAppException: User > application exited with 1 > > at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) > > at org.apache.spark.deploy.RRunner.main(RRunner.scala) > > If I try supplying just flights.csv, I get a different error > > --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv > local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv > > Error: Error in loadDF : analysis error - Path does not exist: hdfs:// > 192.168.0.1:8020/user/root/flights.csv; > > Execution halted > > Exception in thread "main" org.apache.spark.SparkUserAppException: User > application exited with 1 > > at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) > > at org.apache.spark.deploy.RRunner.main(RRunner.scala) > > If the path /user/root/flights.csv does exist and I only supply > "flights.csv" as the file path, it runs to completion successfully. > However, if I provide the file path as "hdfs://192.168.0.1:8020/user/ > root/flights.csv," I get the same "No such file or directory" error as I > do initially. > > Since I obviously can't put all my hdfs files under /user/root, how do I > get it to use the file that the init-container is fetching? > > Thanks, > Jenna >
Spark on K8s - using files fetched by init-container?
This is probably stupid user error, but I can't for the life of me figure out how to access the files that are staged by the init-container. I'm trying to run the SparkR example data-manipulation.R which requires the path to its datafile. I supply the hdfs location via --files and then the full hdfs path. --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv The init-container seems to load my file. 18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv at hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv with timestamp 1519669749519 18/02/26 18:29:09 INFO util.Utils: Fetching hdfs:// 192.168.0.1:8020/user/jhoole/flights.csv to /var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp However, I get an error that my file does not exist. Error in file(file, "rt") : cannot open the connection Calls: read.csv -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv': No such file or directory Execution halted Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) at org.apache.spark.deploy.RRunner.main(RRunner.scala) If I try supplying just flights.csv, I get a different error --files hdfs://192.168.0.1:8020/user/jhoole/flights.csv local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv Error: Error in loadDF : analysis error - Path does not exist: hdfs:// 192.168.0.1:8020/user/root/flights.csv; Execution halted Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104) at org.apache.spark.deploy.RRunner.main(RRunner.scala) If the path /user/root/flights.csv does exist and I only supply "flights.csv" as the file path, it runs to completion successfully. However, if I provide the file path as "hdfs:// 192.168.0.1:8020/user/root/flights.csv," I get the same "No such file or directory" error as I do initially. Since I obviously can't put all my hdfs files under /user/root, how do I get it to use the file that the init-container is fetching? Thanks, Jenna