Jai Murugesh Rajasekaran created SPARK-12984:
------------------------------------------------
Summary: Not able to read CSV file using Spark 1.4.0
Key: SPARK-12984
URL: https://issues.apache.org/jira/browse/SPARK-12984
Project: Spark
Issue Type: Bug
Components: SparkR
Affects Versions: 1.4.0
Environment: Unix
Hadoop 2.7.1.2.3.0.0-2557
R 3.1.1
Don't have Internet on the server
Reporter: Jai Murugesh Rajasekaran
Hi,
We are trying to read a CSV file
Downloaded following CSV related package (jar files) and configured using Maven
1. spark-csv_2.10-1.2.0.jar
2. spark-csv_2.10-1.2.0-sources.jar
3. spark-csv_2.10-1.2.0-javadoc.jar
Trying to execute following script
> library(SparkR)
> sc <- sparkR.init(appName="SparkR-DataFrame")
Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or
restart R to create a new Spark Context
> sqlContext <- sparkRSQL.init(sc)
> setwd("/home/sXXXX/")
> getwd()
[1] "/home/sXXXX"
> path <- file.path("Sample.csv")
> Test <- read.df(sqlContext, path)
Note: I am able to read CSV file using regular R function but when tried using
SparkR functions...ended up with error
Initiated SparkR
$ sh -x sparkR -v --repositories
/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
Error Messages/Log
$ sh -x sparkR -v --repositories
/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
+++ dirname sparkR
++ cd ./..
++ pwd
+ export SPARK_HOME=/opt/spark-1.4.0
+ SPARK_HOME=/opt/spark-1.4.0
+ source /opt/spark-1.4.0/bin/load-spark-env.sh
++++ dirname sparkR
+++ cd ./..
+++ pwd
++ FWDIR=/opt/spark-1.4.0
++ '[' -z '' ']'
++ export SPARK_ENV_LOADED=1
++ SPARK_ENV_LOADED=1
++++ dirname sparkR
+++ cd ./..
+++ pwd
++ parent_dir=/opt/spark-1.4.0
++ user_conf_dir=/opt/spark-1.4.0/conf
++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
++ set -a
++ . /opt/spark-1.4.0/conf/spark-env.sh
+++ export SPARK_HOME=/opt/spark-1.4.0
+++ SPARK_HOME=/opt/spark-1.4.0
+++ export YARN_CONF_DIR=/etc/hadoop/conf
+++ YARN_CONF_DIR=/etc/hadoop/conf
+++ export HADOOP_CONF_DIR=/etc/hadoop/conf
+++ HADOOP_CONF_DIR=/etc/hadoop/conf
+++ export HADOOP_CONF_DIR=/etc/hadoop/conf
+++ HADOOP_CONF_DIR=/etc/hadoop/conf
++ set +a
++ '[' -z '' ']'
++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
++ export SPARK_SCALA_VERSION=2.10
++ SPARK_SCALA_VERSION=2.10
+ export -f usage
+ [[ -v --repositories
/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
= *--help ]]
+ [[ -v --repositories
/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
= *-h ]]
+ exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories
/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Revolution R Enterprise version 7.3: an enhanced distribution of R
Revolution Analytics packages Copyright (C) 2014 Revolution Analytics, Inc.
Type 'revo()' to visit www.revolutionanalytics.com for the latest
Revolution R news, 'forum()' for the community forum, or 'readme()'
for release notes.
Launching java with spark-submit command /opt/spark-1.4.0/bin/spark-submit
"--verbose" "--repositories"
"/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar"
"sparkr-shell" /tmp/RtmpO12CGx/backend_porteb570d7ca99
Using properties file: /opt/spark-1.4.0/conf/spark-defaults.conf
Adding default property:
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.2.0.0-2041
Adding default property:
spark.driver.extraJavaOptions=-Dhdp.version=2.2.0.0-2041
Parsed arguments:
master local[*]
deployMode null
executorMemory null
executorCores null
totalExecutorCores null
propertiesFile /opt/spark-1.4.0/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath null
driverExtraLibraryPath null
driverExtraJavaOptions -Dhdp.version=2.2.0.0-2041
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass null
primaryResource sparkr-shell
name sparkr-shell
childArgs [/tmp/RtmpO12CGx/backend_porteb570d7ca99]
jars null
packages null
repositories
/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
verbose true
Spark properties used, including those specified through
--conf and those from the properties file
/opt/spark-1.4.0/conf/spark-defaults.conf:
spark.driver.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
spark.yarn.am.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
Main class:
org.apache.spark.api.r.RBackend
Arguments:
/tmp/RtmpO12CGx/backend_porteb570d7ca99
System properties:
SPARK_SUBMIT -> true
spark.app.name -> sparkr-shell
spark.driver.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
spark.yarn.am.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
spark.master -> local[*]
Classpath elements:
16/01/21 10:44:34 INFO spark.SparkContext: Running Spark version 1.4.0
16/01/21 10:44:35 INFO spark.SecurityManager: Changing view acls to: sXXXX
16/01/21 10:44:35 INFO spark.SecurityManager: Changing modify acls to: sXXXX
16/01/21 10:44:35 INFO spark.SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(sXXXX); users with
modify permissions: Set(sXXXX)
16/01/21 10:44:36 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/01/21 10:44:36 INFO Remoting: Starting remoting
16/01/21 10:44:36 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://[email protected]:99999]
16/01/21 10:44:36 INFO util.Utils: Successfully started service 'sparkDriver'
on port 99999.
16/01/21 10:44:36 INFO spark.SparkEnv: Registering MapOutputTracker
16/01/21 10:44:36 INFO spark.SparkEnv: Registering BlockManagerMaster
16/01/21 10:44:36 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-522b123c-d80d-4b88-98a7-a251b071704e/blockmgr-8e7084f2-4b1b-465e-8ac1-5b4b3dcf44e5
16/01/21 10:44:36 INFO storage.MemoryStore: MemoryStore started with capacity
265.4 MB
16/01/21 10:44:37 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-522b123c-d80d-4b88-98a7-a251b071704e/httpd-61e30295-e750-4682-9420-37d5162b89c7
16/01/21 10:44:37 INFO spark.HttpServer: Starting HTTP Server
16/01/21 10:44:37 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/01/21 10:44:37 INFO server.AbstractConnector: Started
[email protected]:36797
16/01/21 10:44:37 INFO util.Utils: Successfully started service 'HTTP file
server' on port 36797.
16/01/21 10:44:37 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/01/21 10:44:37 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/01/21 10:44:37 INFO server.AbstractConnector: Started
[email protected]:4040
16/01/21 10:44:37 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.
16/01/21 10:44:37 INFO ui.SparkUI: Started SparkUI at http://99.99.99.99:4040
16/01/21 10:44:37 INFO executor.Executor: Starting executor ID driver on host
localhost
16/01/21 10:44:37 INFO util.Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 36799.
16/01/21 10:44:37 INFO netty.NettyBlockTransferService: Server created on 36799
16/01/21 10:44:37 INFO storage.BlockManagerMaster: Trying to register
BlockManager
16/01/21 10:44:37 INFO storage.BlockManagerMasterEndpoint: Registering block
manager localhost:36799 with 265.4 MB RAM, BlockManagerId(driver, localhost,
36799)
16/01/21 10:44:37 INFO storage.BlockManagerMaster: Registered BlockManager
Welcome to SparkR!
Spark context is available as sc, SQL context is available as sqlContext
During startup - Warning message:
package ‘SparkR’ was built under R version 3.1.3
> library(SparkR)
> sc <- sparkR.init(appName="SparkR-DataFrame")
Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or
restart R to create a new Spark Context
> sqlContext <- sparkRSQL.init(sc)
> setwd("/home/sXXXX/")
> getwd()
[1] "/home/sXXXX"
> path <- file.path("Sample.csv")
> Test <- read.df(sqlContext, path)
16/01/21 10:46:14 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
16/01/21 10:46:14 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
16/01/21 10:46:14 ERROR r.RBackendHandler: load on 1 failed
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
at
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
at
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.AssertionError: assertion failed: No schema defined, and
no Parquet data file or summary file found under .
at scala.Predef$.assert(Predef.scala:179)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:443)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$15.apply(newParquet.scala:385)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$15.apply(newParquet.scala:385)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:385)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:193)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:193)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
at
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:505)
at
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:504)
at
org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230)
... 25 more
Error: returnStatus == 0 is not TRUE
> path <- read.df(sqlContext, "/home/sXXXX/Sample.csv", source =
> "com.databricks.spark.csv")
16/01/21 10:46:48 ERROR r.RBackendHandler: load on 1 failed
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
at
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
at
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230)
... 25 more
Error: returnStatus == 0 is not TRUE
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]