[jira] [Commented] (SPARK-15855) dataframe.R example fails with "java.io.IOException: No input paths specified in job"

Shivaram Venkataraman (JIRA) Thu, 09 Jun 2016 17:53:47 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323667#comment-15323667
 ]


Shivaram Venkataraman commented on SPARK-15855:
-----------------------------------------------

For the example to work in a distributed setup the input file needs to be in 
HDFS or in some other distributed storage system.  The example is designed to 
work out of the box on a single machine. 

> dataframe.R example fails with "java.io.IOException: No input paths specified 
> in job"
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-15855
>                 URL: https://issues.apache.org/jira/browse/SPARK-15855
>             Project: Spark
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 1.6.1
>            Reporter: Yesha Vora
>
> Steps:
> * Install R on all nodes
> * Run dataframe.R example.
> The example fails in yarn-client and yarn-cluster mode both with below 
> mentioned error message.
> This application fails to find people.json correctly.  {{path <- 
> file.path(Sys.getenv("SPARK_HOME"), 
> "examples/src/main/resources/people.json")}}
> {code}
> [xxx@xxx qa]$ sparkR --master yarn-client examples/src/main/r/dataframe.R
> Loading required package: methods
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:stats’:
>     cov, filter, lag, na.omit, predict, sd, var
> The following objects are masked from ‘package:base’:
>     colnames, colnames<-, intersect, rank, rbind, sample, subset,
>     summary, table, transform
> 16/05/24 22:08:21 INFO SparkContext: Running Spark version 1.6.1
> 16/05/24 22:08:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/24 22:08:22 INFO SecurityManager: Changing view acls to: hrt_qa
> 16/05/24 22:08:22 INFO SecurityManager: Changing modify acls to: hrt_qa
> 16/05/24 22:08:22 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hrt_qa); users 
> with modify permissions: Set(hrt_qa)
> 16/05/24 22:08:22 INFO Utils: Successfully started service 'sparkDriver' on 
> port 35792.
> 16/05/24 22:08:23 INFO Slf4jLogger: Slf4jLogger started
> 16/05/24 22:08:23 INFO Remoting: Starting remoting
> 16/05/24 22:08:23 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkdriveractorsys...@xx.xx.xx.xxx:49771]
> 16/05/24 22:08:23 INFO Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 49771.
> 16/05/24 22:08:23 INFO SparkEnv: Registering MapOutputTracker
> 16/05/24 22:08:23 INFO SparkEnv: Registering BlockManagerMaster
> 16/05/24 22:08:23 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-ffed73ad-3e67-4ae5-8734-9338136d3721
> 16/05/24 22:08:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 16/05/24 22:08:24 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/05/24 22:08:24 INFO Server: jetty-8.y.z-SNAPSHOT
> 16/05/24 22:08:24 INFO AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/05/24 22:08:24 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/05/24 22:08:24 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://xx.xx.xx.xxx:4040
> spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
> 16/05/24 22:08:25 INFO Client: Requesting a new application from cluster with 
> 6 NodeManagers
> 16/05/24 22:08:25 INFO Client: Verifying our application has not requested 
> more than the maximum memory capability of the cluster (10240 MB per 
> container)
> 16/05/24 22:08:25 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/05/24 22:08:25 INFO Client: Setting up container launch context for our AM
> 16/05/24 22:08:25 INFO Client: Setting up the launch environment for our AM 
> container
> 16/05/24 22:08:26 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 16/05/24 22:08:26 INFO Client: Using the spark assembly jar on HDFS because 
> you are using HDP, 
> defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar
> 16/05/24 22:08:26 INFO Client: Preparing resources for our AM container
> 16/05/24 22:08:26 INFO YarnSparkHadoopUtil: getting token for namenode: 
> hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003
> 16/05/24 22:08:26 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 187 for 
> hrt_qa on ha-hdfs:mycluster
> 16/05/24 22:08:28 INFO metastore: Trying to connect to metastore with URI 
> thrift://xxx:9083
> 16/05/24 22:08:28 INFO metastore: Connected to metastore.
> 16/05/24 22:08:28 INFO YarnSparkHadoopUtil: HBase class not found 
> java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
> 16/05/24 22:08:28 INFO Client: Using the spark assembly jar on HDFS because 
> you are using HDP, 
> defaultSparkAssembly:hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar
> 16/05/24 22:08:28 INFO Client: Source and destination file systems are the 
> same. Not copying 
> hdfs://mycluster/hdp/apps/2.5.0.0-427/spark/spark-hdp-assembly.jar
> 16/05/24 22:08:29 INFO Client: Uploading resource 
> file:/usr/hdp/current/spark-client/examples/src/main/r/dataframe.R -> 
> hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003/dataframe.R
> 16/05/24 22:08:29 INFO Client: Uploading resource 
> file:/grid/0/spark/R/lib/sparkr.zip#sparkr -> 
> hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003/sparkr.zip
> 16/05/24 22:08:29 INFO Client: Uploading resource 
> file:/tmp/spark-1750e1e9-c468-44dc-9fdc-28b9d1a775c0/__spark_conf__4408641858810811953.zip
>  -> 
> hdfs://mycluster/user/hrt_qa/.sparkStaging/application_1463956206030_0003/__spark_conf__4408641858810811953.zip
> 16/05/24 22:08:29 INFO SecurityManager: Changing view acls to: hrt_qa
> 16/05/24 22:08:29 INFO SecurityManager: Changing modify acls to: hrt_qa
> 16/05/24 22:08:29 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hrt_qa); users 
> with modify permissions: Set(hrt_qa)
> 16/05/24 22:08:29 INFO Client: Submitting application 3 to ResourceManager
> 16/05/24 22:08:30 INFO YarnClientImpl: Submitted application 
> application_1463956206030_0003
> 16/05/24 22:08:30 INFO SchedulerExtensionServices: Starting Yarn extension 
> services with app application_1463956206030_0003 and attemptId None
> 16/05/24 22:08:31 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:31 INFO Client: 
>        client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>        diagnostics: AM container is launched, waiting for AM container to 
> Register with RM
>        ApplicationMaster host: N/A
>        ApplicationMaster RPC port: -1
>        queue: default
>        start time: 1464127709850
>        final status: UNDEFINED
>        tracking URL: http://xxx:8088/proxy/application_1463956206030_0003/
>        user: hrt_qa
> 16/05/24 22:08:32 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:33 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:34 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:35 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:36 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:37 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:38 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:39 INFO Client: Application report for 
> application_1463956206030_0003 (state: ACCEPTED)
> 16/05/24 22:08:39 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: 
> ApplicationMaster registered as NettyRpcEndpointRef(null)
> 16/05/24 22:08:39 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS 
> -> xxx PROXY_URI_BASES -> 
> http://xxx:8088/proxy/application_1463956206030_0003,http://xxx:8088/proxy/application_1463956206030_0003),
>  /proxy/application_1463956206030_0003
> 16/05/24 22:08:39 INFO JettyUtils: Adding filter: 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
> 16/05/24 22:08:40 INFO Client: Application report for 
> application_1463956206030_0003 (state: RUNNING)
> 16/05/24 22:08:40 INFO Client: 
>        client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
>        diagnostics: N/A
>        ApplicationMaster host: xxx
>        ApplicationMaster RPC port: 0
>        queue: default
>        start time: 1464127709850
>        final status: UNDEFINED
>        tracking URL: http://xxx:8088/proxy/application_1463956206030_0003/
>        user: hrt_qa
> 16/05/24 22:08:40 INFO YarnClientSchedulerBackend: Application 
> application_1463956206030_0003 has started running.
> 16/05/24 22:08:40 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40317.
> 16/05/24 22:08:40 INFO NettyBlockTransferService: Server created on 40317
> 16/05/24 22:08:40 INFO BlockManagerMaster: Trying to register BlockManager
> 16/05/24 22:08:40 INFO BlockManagerMasterEndpoint: Registering block manager 
> xx.xx.xx.xxx:40317 with 511.1 MB RAM, BlockManagerId(driver, xxx, 40317)
> 16/05/24 22:08:40 INFO BlockManagerMaster: Registered BlockManager
> 16/05/24 22:08:40 INFO EventLoggingListener: Logging events to 
> hdfs:///spark-history/application_1463956206030_0003
> 16/05/24 22:08:47 INFO YarnClientSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (xxx:39482) with ID 2
> 16/05/24 22:08:47 INFO BlockManagerMasterEndpoint: Registering block manager 
> xxx:57829 with 511.1 MB RAM, BlockManagerId(2, xxx, 57829)
> 16/05/24 22:08:48 INFO YarnClientSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (xxx:38913) with ID 1
> 16/05/24 22:08:48 INFO BlockManagerMasterEndpoint: Registering block manager 
> xxx:42642 with 511.1 MB RAM, BlockManagerId(1, xxx, 42642)
> 16/05/24 22:08:48 INFO YarnClientSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/05/24 22:08:49 INFO SparkContext: Starting job: collectPartitions at 
> NativeMethodAccessorImpl.java:-2
> 16/05/24 22:08:49 INFO DAGScheduler: Got job 0 (collectPartitions at 
> NativeMethodAccessorImpl.java:-2) with 1 output partitions
> 16/05/24 22:08:49 INFO DAGScheduler: Final stage: ResultStage 0 
> (collectPartitions at NativeMethodAccessorImpl.java:-2)
> 16/05/24 22:08:49 INFO DAGScheduler: Parents of final stage: List()
> 16/05/24 22:08:49 INFO DAGScheduler: Missing parents: List()
> 16/05/24 22:08:49 INFO DAGScheduler: Submitting ResultStage 0 
> (ParallelCollectionRDD[0] at parallelize at RRDD.scala:460), which has no 
> missing parents
> 16/05/24 22:08:49 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 1344.0 B, free 1344.0 B)
> 16/05/24 22:08:49 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 890.0 B, free 2.2 KB)
> 16/05/24 22:08:49 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on xxx:40317 (size: 890.0 B, free: 511.1 MB)
> 16/05/24 22:08:49 INFO SparkContext: Created broadcast 0 from broadcast at 
> DAGScheduler.scala:1006
> 16/05/24 22:08:49 INFO DAGScheduler: Submitting 1 missing tasks from 
> ResultStage 0 (ParallelCollectionRDD[0] at parallelize at RRDD.scala:460)
> 16/05/24 22:08:49 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
> 16/05/24 22:08:49 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> xxx, partition 0,PROCESS_LOCAL, 2230 bytes)
> 16/05/24 22:08:49 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on xxx:57829 (size: 890.0 B, free: 511.1 MB)
> 16/05/24 22:08:50 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
> in 714 ms on xxx (1/1)
> 16/05/24 22:08:50 INFO DAGScheduler: ResultStage 0 (collectPartitions at 
> NativeMethodAccessorImpl.java:-2) finished in 0.729 s
> 16/05/24 22:08:50 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have 
> all completed, from pool 
> 16/05/24 22:08:50 INFO DAGScheduler: Job 0 finished: collectPartitions at 
> NativeMethodAccessorImpl.java:-2, took 1.142339 s
> root
>  |-- name: string (nullable = true)
>  |-- age: double (nullable = true)
> 16/05/24 22:08:51 INFO JSONRelation: Listing 
> hdfs://mycluster/xxx/spark/examples/src/main/resources/people.json on driver
> 16/05/24 22:08:51 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 335.3 KB, free 337.5 KB)
> 16/05/24 22:08:51 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
> in memory (estimated size 28.8 KB, free 366.3 KB)
> 16/05/24 22:08:51 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on xxx:40317 (size: 28.8 KB, free: 511.1 MB)
> 16/05/24 22:08:51 INFO SparkContext: Created broadcast 1 from json at 
> NativeMethodAccessorImpl.java:-2
> 16/05/24 22:08:51 ERROR RBackendHandler: json on 15 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.io.IOException: No input paths specified in job
>       at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:202)
>       at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>       at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2
> Calls: read.json -> callJMethod -> invokeJava
> Execution halted
> 16/05/24 22:08:52 INFO SparkContext: Invoking stop() from shutdown hook
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static/sql,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/execution,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/SQL,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/metrics/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/api,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/environment/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/environment,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/storage/rdd,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/storage/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/storage,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/pool,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/stage,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/jobs/job/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/jobs/job,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/jobs/json,null}
> 16/05/24 22:08:52 INFO ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/jobs,null}
> 16/05/24 22:08:52 INFO SparkUI: Stopped Spark web UI at http://xxx:4040
> 16/05/24 22:08:52 INFO YarnClientSchedulerBackend: Interrupting monitor thread
> 16/05/24 22:08:52 INFO YarnClientSchedulerBackend: Shutting down all executors
> 16/05/24 22:08:52 INFO YarnClientSchedulerBackend: Asking each executor to 
> shut down
> 16/05/24 22:08:52 INFO SchedulerExtensionServices: Stopping 
> SchedulerExtensionServices
> (serviceOption=None,
>  services=List(),
>  started=false)
> 16/05/24 22:08:52 INFO YarnClientSchedulerBackend: Stopped
> 16/05/24 22:08:52 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 16/05/24 22:08:52 INFO MemoryStore: MemoryStore cleared
> 16/05/24 22:08:52 INFO BlockManager: BlockManager stopped
> 16/05/24 22:08:52 INFO BlockManagerMaster: BlockManagerMaster stopped
> 16/05/24 22:08:52 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 16/05/24 22:08:52 INFO SparkContext: Successfully stopped SparkContext
> 16/05/24 22:08:52 INFO ShutdownHookManager: Shutdown hook called
> 16/05/24 22:08:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 16/05/24 22:08:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-1750e1e9-c468-44dc-9fdc-28b9d1a775c0
> 16/05/24 22:08:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15855) dataframe.R example fails with "java.io.IOException: No input paths specified in job"

Reply via email to