[
https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128573#comment-15128573
]
Jai Murugesh Rajasekaran commented on SPARK-12984:
--------------------------------------------------
Yes Sun Rui...It needs to be updated.
If we don't have internet in the server (I believe in most corporate
environment Internet won't be available due to various security reasons)...it
will be good to specify the options.
I downloaded "spark-csv_2.10-1.3.0.jar" & "commons-csv-1.2.jar" manually from
Internet in my Laptop & uploaded to Server.
Following statement helped to associating jar files with sparkR
$sparkR --jars
/home/sXXXX/spark-csv_2.10-1.3.0.jar,/home/sXXXX/commons-csv-1.2.jar
Sure I will create Pull Request.
Thanks
> Not able to read CSV file using Spark 1.4.0
> -------------------------------------------
>
> Key: SPARK-12984
> URL: https://issues.apache.org/jira/browse/SPARK-12984
> Project: Spark
> Issue Type: Bug
> Components: SparkR
> Affects Versions: 1.4.0
> Environment: Unix
> Hadoop 2.7.1.2.3.0.0-2557
> R 3.1.1
> Don't have Internet on the server
> Reporter: Jai Murugesh Rajasekaran
>
> Hi,
> We are trying to read a CSV file
> Downloaded following CSV related package (jar files) and configured using
> Maven
> 1. spark-csv_2.10-1.2.0.jar
> 2. spark-csv_2.10-1.2.0-sources.jar
> 3. spark-csv_2.10-1.2.0-javadoc.jar
> Trying to execute following script
> > library(SparkR)
> > sc <- sparkR.init(appName="SparkR-DataFrame")
> Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or
> restart R to create a new Spark Context
> > sqlContext <- sparkRSQL.init(sc)
> > setwd("/home/sXXXX/")
> > getwd()
> [1] "/home/sXXXX"
> > path <- file.path("Sample.csv")
> > Test <- read.df(sqlContext, path)
> Note: I am able to read CSV file using regular R function but when tried
> using SparkR functions...ended up with error
> Initiated SparkR
> $ sh -x sparkR -v --repositories
> /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> Error Messages/Log
> $ sh -x sparkR -v --repositories
> /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> +++ dirname sparkR
> ++ cd ./..
> ++ pwd
> + export SPARK_HOME=/opt/spark-1.4.0
> + SPARK_HOME=/opt/spark-1.4.0
> + source /opt/spark-1.4.0/bin/load-spark-env.sh
> ++++ dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ FWDIR=/opt/spark-1.4.0
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
> ++++ dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ parent_dir=/opt/spark-1.4.0
> ++ user_conf_dir=/opt/spark-1.4.0/conf
> ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
> ++ set -a
> ++ . /opt/spark-1.4.0/conf/spark-env.sh
> +++ export SPARK_HOME=/opt/spark-1.4.0
> +++ SPARK_HOME=/opt/spark-1.4.0
> +++ export YARN_CONF_DIR=/etc/hadoop/conf
> +++ YARN_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
> ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
> ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + export -f usage
> + [[ -v --repositories
> /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> = *--help ]]
> + [[ -v --repositories
> /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> = *-h ]]
> + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories
> /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> R version 3.1.1 (2014-07-10) -- "Sock it to Me"
> Copyright (C) 2014 The R Foundation for Statistical Computing
> Platform: x86_64-unknown-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
> Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Revolution R Enterprise version 7.3: an enhanced distribution of R
> Revolution Analytics packages Copyright (C) 2014 Revolution Analytics, Inc.
> Type 'revo()' to visit www.revolutionanalytics.com for the latest
> Revolution R news, 'forum()' for the community forum, or 'readme()'
> for release notes.
> Launching java with spark-submit command /opt/spark-1.4.0/bin/spark-submit
> "--verbose" "--repositories"
> "/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar"
> "sparkr-shell" /tmp/RtmpO12CGx/backend_porteb570d7ca99
> Using properties file: /opt/spark-1.4.0/conf/spark-defaults.conf
> Adding default property:
> spark.yarn.am.extraJavaOptions=-Dhdp.version=2.2.0.0-2041
> Adding default property:
> spark.driver.extraJavaOptions=-Dhdp.version=2.2.0.0-2041
> Parsed arguments:
> master local[*]
> deployMode null
> executorMemory null
> executorCores null
> totalExecutorCores null
> propertiesFile /opt/spark-1.4.0/conf/spark-defaults.conf
> driverMemory null
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions -Dhdp.version=2.2.0.0-2041
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives null
> mainClass null
> primaryResource sparkr-shell
> name sparkr-shell
> childArgs [/tmp/RtmpO12CGx/backend_porteb570d7ca99]
> jars null
> packages null
> repositories
> /home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/sXXXX/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file
> /opt/spark-1.4.0/conf/spark-defaults.conf:
> spark.driver.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
> spark.yarn.am.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
> Main class:
> org.apache.spark.api.r.RBackend
> Arguments:
> /tmp/RtmpO12CGx/backend_porteb570d7ca99
> System properties:
> SPARK_SUBMIT -> true
> spark.app.name -> sparkr-shell
> spark.driver.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
> spark.yarn.am.extraJavaOptions -> -Dhdp.version=2.2.0.0-2041
> spark.master -> local[*]
> Classpath elements:
> 16/01/21 10:44:34 INFO spark.SparkContext: Running Spark version 1.4.0
> 16/01/21 10:44:35 INFO spark.SecurityManager: Changing view acls to: sXXXX
> 16/01/21 10:44:35 INFO spark.SecurityManager: Changing modify acls to: sXXXX
> 16/01/21 10:44:35 INFO spark.SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(sXXXX); users
> with modify permissions: Set(sXXXX)
> 16/01/21 10:44:36 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/01/21 10:44:36 INFO Remoting: Starting remoting
> 16/01/21 10:44:36 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://[email protected]:99999]
> 16/01/21 10:44:36 INFO util.Utils: Successfully started service 'sparkDriver'
> on port 99999.
> 16/01/21 10:44:36 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/01/21 10:44:36 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/01/21 10:44:36 INFO storage.DiskBlockManager: Created local directory at
> /tmp/spark-522b123c-d80d-4b88-98a7-a251b071704e/blockmgr-8e7084f2-4b1b-465e-8ac1-5b4b3dcf44e5
> 16/01/21 10:44:36 INFO storage.MemoryStore: MemoryStore started with capacity
> 265.4 MB
> 16/01/21 10:44:37 INFO spark.HttpFileServer: HTTP File server directory is
> /tmp/spark-522b123c-d80d-4b88-98a7-a251b071704e/httpd-61e30295-e750-4682-9420-37d5162b89c7
> 16/01/21 10:44:37 INFO spark.HttpServer: Starting HTTP Server
> 16/01/21 10:44:37 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/01/21 10:44:37 INFO server.AbstractConnector: Started
> [email protected]:36797
> 16/01/21 10:44:37 INFO util.Utils: Successfully started service 'HTTP file
> server' on port 36797.
> 16/01/21 10:44:37 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/01/21 10:44:37 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/01/21 10:44:37 INFO server.AbstractConnector: Started
> [email protected]:4040
> 16/01/21 10:44:37 INFO util.Utils: Successfully started service 'SparkUI' on
> port 4040.
> 16/01/21 10:44:37 INFO ui.SparkUI: Started SparkUI at http://99.99.99.99:4040
> 16/01/21 10:44:37 INFO executor.Executor: Starting executor ID driver on host
> localhost
> 16/01/21 10:44:37 INFO util.Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36799.
> 16/01/21 10:44:37 INFO netty.NettyBlockTransferService: Server created on
> 36799
> 16/01/21 10:44:37 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/01/21 10:44:37 INFO storage.BlockManagerMasterEndpoint: Registering block
> manager localhost:36799 with 265.4 MB RAM, BlockManagerId(driver, localhost,
> 36799)
> 16/01/21 10:44:37 INFO storage.BlockManagerMaster: Registered BlockManager
> Welcome to SparkR!
> Spark context is available as sc, SQL context is available as sqlContext
> During startup - Warning message:
> package ‘SparkR’ was built under R version 3.1.3
> > library(SparkR)
> > sc <- sparkR.init(appName="SparkR-DataFrame")
> Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or
> restart R to create a new Spark Context
> > sqlContext <- sparkRSQL.init(sc)
> > setwd("/home/sXXXX/")
> > getwd()
> [1] "/home/sXXXX"
> > path <- file.path("Sample.csv")
> > Test <- read.df(sqlContext, path)
> 16/01/21 10:46:14 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/01/21 10:46:14 WARN hdfs.BlockReaderLocal: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
> 16/01/21 10:46:14 ERROR r.RBackendHandler: load on 1 failed
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
> at
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
> at
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> at
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.AssertionError: assertion failed: No schema defined, and
> no Parquet data file or summary file found under .
> at scala.Predef$.assert(Predef.scala:179)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:443)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$15.apply(newParquet.scala:385)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$15.apply(newParquet.scala:385)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:385)
> at
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
> at
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:193)
> at
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:193)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
> at
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:505)
> at
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:504)
> at
> org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:120)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230)
> ... 25 more
> Error: returnStatus == 0 is not TRUE
> > path <- read.df(sqlContext, "/home/sXXXX/Sample.csv", source =
> > "com.databricks.spark.csv")
> 16/01/21 10:46:48 ERROR r.RBackendHandler: load on 1 failed
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
> at
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
> at
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> at
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.RuntimeException: Failed to load class for data source:
> com.databricks.spark.csv
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:216)
> at
> org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:229)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1230)
> ... 25 more
> Error: returnStatus == 0 is not TRUE
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]