Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hey Shuporno,

Thanks for a prompt reply. Thanks for noticing the silly mistake, I tried
this out, but still getting another error, which is related to connectivity
it seems.

>>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
> >>> a = spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
> header=True)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
> return f(*a, **kw)
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException:
> com.amazonaws.auth.AWSCredentialsProvider
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 28 more



Thanks,
Aakash.

On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

>
>
> On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <
> shuporno.choudh...@gmail.com> wrote:
>
>> Hi,
>> Your connection config uses 's3n' but your read command uses 's3a'.
>> The config for s3a are:
>> spark.hadoop.fs.s3a.access.key
>> spark.hadoop.fs.s3a.secret.key
>>
>> I feel this should solve the problem.
>>
>> On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List] <
>> ml+s1001560n34215...@n3.nabble.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to connect to AWS S3 and read a csv file (running POC) from
>>> a bucket.
>>>
>>> I have s3cmd and and being able to run ls and other operation from cli.
>>>
>>> *Present Configuration:*
>>> Python 3.7
>>> Spark 2.3.1
>>>
>>> *JARs added:*
>>> hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
>>> aws-java-sdk-1.11.472.jar
>>>
>>> Trying out the following code:
>>>
>>> >>> sc=spark.sparkContext

 >>> hadoop_conf=sc._jsc.hadoopConfiguration()

 >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")

 >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")

 >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
 header=True)

 Traceback (most recent call last):

   File "", line 1, in 

   File
 

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Shuporno Choudhury
On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
> Your connection config uses 's3n' but your read command uses 's3a'.
> The config for s3a are:
> spark.hadoop.fs.s3a.access.key
> spark.hadoop.fs.s3a.secret.key
>
> I feel this should solve the problem.
>
> On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List] <
> ml+s1001560n34215...@n3.nabble.com> wrote:
>
>> Hi,
>>
>> I am trying to connect to AWS S3 and read a csv file (running POC) from a
>> bucket.
>>
>> I have s3cmd and and being able to run ls and other operation from cli.
>>
>> *Present Configuration:*
>> Python 3.7
>> Spark 2.3.1
>>
>> *JARs added:*
>> hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
>> aws-java-sdk-1.11.472.jar
>>
>> Trying out the following code:
>>
>> >>> sc=spark.sparkContext
>>>
>>> >>> hadoop_conf=sc._jsc.hadoopConfiguration()
>>>
>>> >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")
>>>
>>> >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")
>>>
>>> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
>>> header=True)
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>> line 441, in csv
>>>
>>> return
>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>> line 1257, in __call__
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>>> line 63, in deco
>>>
>>> return f(*a, **kw)
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>>> line 328, in get_return_value
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
>>>
>>> : java.lang.NoClassDefFoundError:
>>> com/amazonaws/auth/AWSCredentialsProvider
>>>
>>> at java.lang.Class.forName0(Native Method)
>>>
>>> at java.lang.Class.forName(Class.java:348)
>>>
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>>
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>>
>>> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>>>
>>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>>>
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>>>
>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>>>
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>>>
>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>>>
>>> at
>>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>>>
>>> at
>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>>>
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>>>
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>>
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>>
>>> at py4j.Gateway.invoke(Gateway.java:282)
>>>
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>>
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>
>>> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>>>
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.amazonaws.auth.AWSCredentialsProvider
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> ... 28 more
>>>
>>>
>>> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
>>> header=True)
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>> line 441, in csv
>>>
>>> return
>>> 

Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hi,

I am trying to connect to AWS S3 and read a csv file (running POC) from a
bucket.

I have s3cmd and and being able to run ls and other operation from cli.

*Present Configuration:*
Python 3.7
Spark 2.3.1

*JARs added:*
hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
aws-java-sdk-1.11.472.jar

Trying out the following code:

>>> sc=spark.sparkContext
>
> >>> hadoop_conf=sc._jsc.hadoopConfiguration()
>
> >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")
>
> >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")
>
> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
> header=True)
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
>
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
>
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:282)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassNotFoundException:
> com.amazonaws.auth.AWSCredentialsProvider
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 28 more
>
>
> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
> header=True)
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
>
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o67.csv.
>
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at
> 

Re:running updates using SPARK

2018-12-20 Thread 大啊
I think Spark is a Calculation engine design for OLAP or Ad-hoc.Spark is not 
a traditional relational database,UPDATE need some mandatory constraint like 
transaction and lock. 






At 2018-12-21 06:05:54, "Gourav Sengupta"  wrote:

Hi,


Is there any time soon that SPARK will support UPDATES?  Databricks does 
provide Delta which supports UPDATE, but I think that the open source SPARK 
does not have the UPDATE option. 


HIVE has been supporting UPDATES for a very very long time now, and I was 
thinking when would that become available to SPARK users? With GDPR that 
becomes kind of important.




Regards,
Gourav

Re:Re: [Spark SQL]use zstd, No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD

2018-12-20 Thread 大啊
I think your hive table using CompressionCodecName, but your 
parquet-hadoop-bundle.jar in spark classpaths is not a correct version. 







At 2018-12-21 12:07:22, "Jiaan Geng"  wrote:
>I think your hive table using CompressionCodecName, but your
>parquet-hadoop-bundle.jar in spark classpaths is not a correct version.
>
>
>
>--
>Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
>-
>To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Spark not working with Hadoop 4mc compression

2018-12-20 Thread Jiaan Geng
I think com.hadoop.compression.lzo.LzoCodec not in spark classpaths,please
put suitable hadoop-lzo.jar into directory spark_home/jars/. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark SQL]use zstd, No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD

2018-12-20 Thread Jiaan Geng
I think your hive table using CompressionCodecName, but your
parquet-hadoop-bundle.jar in spark classpaths is not a correct version.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: running updates using SPARK

2018-12-20 Thread Jiaan Geng
I think Spark is a Calculation engine design for OLAP or Ad-hoc.Spark is not
a traditional relational database,UPDATE need some mandatory constraint like
transaction and lock.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Multiple sessions in one application?

2018-12-20 Thread Jiaan Geng
This scene is rare.
When you provide a web server for spark. maybe you need it.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re:Re: Custom Metric Sink on Executor Always ClassNotFound

2018-12-20 Thread prosp4300



Thanks a lot for the explanation
Spark declare the Sink trait with package private, that's why the package looks 
weird, the metric system seems not intent to be extended
package org.apache.spark.metrics.sink
private[spark] trait Sink
Make the custom sink class available on every executor system classpath is what 
an application developer want to avoid, because the sink only required for 
specific application, and it can be difficult to maintain.
If it's possible to get MetricSystem at executor level and register the custom 
sink there, then the problem can be resolved in a better way, not sure how to 
achieve this.
Thanks a lot










At 2018-12-21 05:53:31, "Marcelo Vanzin"  wrote:
>First, it's really weird to use "org.apache.spark" for a class that is
>not in Spark.
>
>For executors, the jar file of the sink needs to be in the system
>classpath; the application jar is not in the system classpath, so that
>does not work. There are different ways for you to get it there, most
>of them manual (YARN is, I think, the only RM supported in Spark where
>the application itself can do it).
>
>On Thu, Dec 20, 2018 at 1:48 PM prosp4300  wrote:
>>
>> Hi, Spark Users
>>
>> I'm play with spark metric monitoring, and want to add a custom sink which 
>> is HttpSink that send the metric through Restful API
>> A subclass of Sink "org.apache.spark.metrics.sink.HttpSink" is created and 
>> packaged within application jar
>>
>> It works for driver instance, but once enabled for executor instance, 
>> following ClassNotFoundException will be throw out. This seems due to 
>> MetricSystem is started very early for executor before application jar is 
>> loaded.
>>
>> I wonder is there any way or best practice to add custom sink for executor 
>> instance?
>>
>> 18/12/21 04:58:32 ERROR MetricsSystem: Sink class 
>> org.apache.spark.metrics.sink.HttpSink cannot be instantiated
>> 18/12/21 04:58:32 WARN UserGroupInformation: PriviledgedActionException 
>> as:yarn (auth:SIMPLE) cause:java.lang.ClassNotFoundException: 
>> org.apache.spark.metrics.sink.HttpSink
>> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>> at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1933)
>> at 
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>> at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>> at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
>> at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>> Caused by: java.lang.ClassNotFoundException: 
>> org.apache.spark.metrics.sink.HttpSink
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
>> at 
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
>> at 
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
>> at 
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>> at 
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>> at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>> at 
>> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
>> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
>> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
>> at 
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:223)
>> at 
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
>> at 
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>> ... 4 more
>> stdout0,*container_e81_1541584460930_3814_01_05�
>> spark.log36118/12/21 04:58:00 ERROR 
>> org.apache.spark.metrics.MetricsSystem.logError:70 - Sink class 
>> org.apache.spark.metrics.sink.HttpSink cannot be instantiated
>>
>>
>>
>>
>
>
>
>-- 
>Marcelo


running updates using SPARK

2018-12-20 Thread Gourav Sengupta
Hi,

Is there any time soon that SPARK will support UPDATES?  Databricks does
provide Delta which supports UPDATE, but I think that the open source SPARK
does not have the UPDATE option.

HIVE has been supporting UPDATES for a very very long time now, and I was
thinking when would that become available to SPARK users? With GDPR that
becomes kind of important.


Regards,
Gourav


Re: Custom Metric Sink on Executor Always ClassNotFound

2018-12-20 Thread Marcelo Vanzin
First, it's really weird to use "org.apache.spark" for a class that is
not in Spark.

For executors, the jar file of the sink needs to be in the system
classpath; the application jar is not in the system classpath, so that
does not work. There are different ways for you to get it there, most
of them manual (YARN is, I think, the only RM supported in Spark where
the application itself can do it).

On Thu, Dec 20, 2018 at 1:48 PM prosp4300  wrote:
>
> Hi, Spark Users
>
> I'm play with spark metric monitoring, and want to add a custom sink which is 
> HttpSink that send the metric through Restful API
> A subclass of Sink "org.apache.spark.metrics.sink.HttpSink" is created and 
> packaged within application jar
>
> It works for driver instance, but once enabled for executor instance, 
> following ClassNotFoundException will be throw out. This seems due to 
> MetricSystem is started very early for executor before application jar is 
> loaded.
>
> I wonder is there any way or best practice to add custom sink for executor 
> instance?
>
> 18/12/21 04:58:32 ERROR MetricsSystem: Sink class 
> org.apache.spark.metrics.sink.HttpSink cannot be instantiated
> 18/12/21 04:58:32 WARN UserGroupInformation: PriviledgedActionException 
> as:yarn (auth:SIMPLE) cause:java.lang.ClassNotFoundException: 
> org.apache.spark.metrics.sink.HttpSink
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1933)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.metrics.sink.HttpSink
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
> at 
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
> at 
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
> at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at 
> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:223)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
> ... 4 more
> stdout0,*container_e81_1541584460930_3814_01_05�
> spark.log36118/12/21 04:58:00 ERROR 
> org.apache.spark.metrics.MetricsSystem.logError:70 - Sink class 
> org.apache.spark.metrics.sink.HttpSink cannot be instantiated
>
>
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Custom Metric Sink on Executor Always ClassNotFound

2018-12-20 Thread prosp4300
Hi, Spark Users


I'm play with spark metric monitoring, and want to add a custom sink which is 
HttpSink that send the metric through Restful API 
A subclass of Sink "org.apache.spark.metrics.sink.HttpSink" is created and 
packaged within application jar


It works for driver instance, but once enabled for executor instance, following 
ClassNotFoundException will be throw out. This seems due to MetricSystem is 
started very early for executor before application jar is loaded.


I wonder is there any way or best practice to add custom sink for executor 
instance? 


18/12/21 04:58:32 ERROR MetricsSystem: Sink class 
org.apache.spark.metrics.sink.HttpSink cannot be instantiated
18/12/21 04:58:32 WARN UserGroupInformation: PriviledgedActionException as:yarn 
(auth:SIMPLE) cause:java.lang.ClassNotFoundException: 
org.apache.spark.metrics.sink.HttpSink
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1933)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.metrics.sink.HttpSink
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:223)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
... 4 more
stdout0,*container_e81_1541584460930_3814_01_05�
spark.log36118/12/21 04:58:00 ERROR 
org.apache.spark.metrics.MetricsSystem.logError:70 - Sink class 
org.apache.spark.metrics.sink.HttpSink cannot be instantiated

[Spark cluster standalone v2.4.0] - problems with reverse proxy functionnality regarding submitted applications in cluster mode and the spark history server ui

2018-12-20 Thread Cheikh_SOW
Hello,

I have many spark clusters in standalone mode with 3 nodes each. One of them
is in HA with 3 masters and 3 workers and everything regarding the HA is
working fine. The second one is not in HA mode and we have one master and 3
workers.

In both of them, I have configured the reverse proxy for the UIs and
everything is working fine except two things:

1 - When I submit an application in cluster mode and the driver is launched
on a worker different from the one from which I do the submit command, the
access to the application UI through the master UI in reverse proxy is
broken and we have this error : *HTTP ERROR 502 \n Problem accessing
/myCustomProxyBase/proxy/app-20181203110740-. Reason: \n \tBad Gateway*
The workers UIs and this one which list the running executors
(/myCustomProxyBase/app/?appId=app-20181218104130-0006) work very fine.
However, when the driver is launched on the same worker from which the
submit command is executed, the access to the application through the master
works fine and others uris too (I mean the problem disappear).

2 - When the reverse proxy is setted, I can't access anymore to the history
server UI (many js and css errors) in both clusters.

Really need help,

Thanks,
Cheikh SOW



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problem running Spark on Kubernetes: Certificate error

2018-12-20 Thread Steven Stetzler
Hi Matt,

Thank your for the help. This worked for me. For posterity: convert the
data in the certificate-authority-data field of your DigitalOcean
Kubernetes configuration file, which is downloaded from their site, from
base64 to PEM format and save it to a file ca.crt, then submit with --conf
spark.kubernetes.authenticate.driver.caCertFile=ca.crt.

As an extension to this question, I am also interested in using PySpark on
a JupyterHub deployment to interact with my cluster. The JupyterHub
deployment provides a Python (with PySpark) environment in a pod on a node
in the cluster, and I would like to create a SparkContext for computations
that uses the same cluster. I have attempted to replicate the above
solution there using:

from pyspark import *
> conf = SparkConf()
> conf.setMaster('k8s://')
> conf.set('spark.kubernetes.container.image', '')
> conf.set('spark.kubernetes.authenticate.driver.caCertFile', 'ca.crt')
> sc = SparkContext(conf=conf)


which returns an identical error:

---
> Py4JJavaError Traceback (most recent call last)
>  in 
> 22
> 23
> ---> 24 sc = SparkContext(conf=conf)
> /usr/local/spark/python/pyspark/context.py in __init__(self, master,
> appName, sparkHome, pyFiles, environment, batchSize, serializer, conf,
> gateway, jsc, profiler_cls)
> 116 try:
> 117 self._do_init(master, appName, sparkHome, pyFiles, environment,
> batchSize, serializer,
> --> 118 conf, jsc, profiler_cls) 119 except:
> 120 # If an error occurs, clean up in order to allow future SparkContext
> creation: /usr/local/spark/python/pyspark/context.py in _do_init(self,
> master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
> conf, jsc, profiler_cls)
> 178
> 179 # Create the Java SparkContext through Py4J
> --> 180 self._jsc = jsc or self._initialize_context(self._conf._jconf)
> 181 # Reset the SparkConf to the one actually used by the SparkContext in
> JVM.
> 182 self._conf = SparkConf(_jconf=self._jsc.sc().conf())
> /usr/local/spark/python/pyspark/context.py in _initialize_context(self,
> jconf)
> 286 Initialize SparkContext in function to allow subclass specific
> initialization
> 287 """
> --> 288 return self._jvm.JavaSparkContext(jconf)
> 289
> 290 @classmethod
> /usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in
> __call__(self, *args)
> 1523 answer = self._gateway_client.send_command(command)
> 1524 return_value = get_return_value(
> -> 1525 answer, self._gateway_client, None, self._fqn) 1526
> 1527 for temp_arg in temp_args:
> /usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in
> get_return_value(answer, gateway_client, target_id, name)
> 326 raise Py4JJavaError(
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value) 329 else:
> 330 raise Py4JError( Py4JJavaError: An error occurred while calling
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: External scheduler cannot be
> instantiated
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
> at org.apache.spark.SparkContext.(SparkContext.scala:493)
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:238)
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An
> error has occurred.
> at
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
> at
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
> at
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:167)
> at
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84)
> at
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:64)
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
> ... 13 more
> Caused by: java.security.cert.CertificateException: Could not parse
> certificate: 

Re: Spark job on dataproc failing with Exception in thread "main" java.lang.NoSuchMethodError: com.googl

2018-12-20 Thread Muthu Jayakumar
The error reads as Precondition.checkArgument() method is on an incorrect
parameter signature.
Could you check to see how many jars (before the Uber jar), actually
contain this method signature?
I smell an issue with jar version conflict or similar.

Thanks
Muthu

On Thu, Dec 20, 2018, 02:40 Mich Talebzadeh 
wrote:

> Anyone in Spark user group seen this error in case?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 20 Dec 2018 at 09:38,  wrote:
>
>> Hi,
>>
>> I am trying a basic Spark job in Scala program. I compile it with SBT
>> with the following dependencies
>>
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0"
>> % "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" %
>> "1.6.1" % "provided"
>> libraryDependencies += "org.apache.phoenix" % "phoenix-spark" %
>> "4.6.0-HBase-1.0"
>> libraryDependencies += "org.apache.hbase" % "hbase" % "1.2.6"
>> libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.2.6"
>> libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.2.6"
>> libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.2.6"
>> libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" %
>> "2.2.0"
>> libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.8.1"
>> libraryDependencies += "org.apache.spark" %% "spark-streaming-twitter" %
>> "1.6.3"
>> libraryDependencies += "com.google.cloud.bigdataoss" %
>> "bigquery-connector" % "0.13.4-hadoop3"
>> libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" %
>> "1.9.4-hadoop3"
>> libraryDependencies += "com.google.code.gson" % "gson" % "2.8.5"
>> libraryDependencies += "com.google.guava" % "guava" % "27.0.1-jre"
>> libraryDependencies += "org.apache.httpcomponents" % "httpcore" % "4.4.8"
>>
>> It compiles fine and creates the Uber jar file. But when I run I get the
>> following error.
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
>> at
>> com.google.cloud.hadoop.io.bigquery.BigQueryStrings.parseTableReference(BigQueryStrings.java:68)
>> at
>> com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration.configureBigQueryInput(BigQueryConfiguration.java:260)
>> at simple$.main(simple.scala:150)
>> at simple.main(simple.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Sounds like there is incompatibility in GUAVA versions between compiles
>> and run? These are the versions thar are used:
>>
>>
>>- Java openjdk version "1.8.0_181"
>>- Spark version 2.3.2
>>- Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
>>
>>
>> Appreciate any feedback.
>>
>> Thanks,
>>
>> Mich
>>
>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Google Cloud Dataproc Discussions" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to cloud-dataproc-discuss+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/cloud-dataproc-discuss/12ea0075-c18f-4c46-adbf-958ca24730d1%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>


Re: Spark job on dataproc failing with Exception in thread "main" java.lang.NoSuchMethodError: com.googl

2018-12-20 Thread Mich Talebzadeh
Anyone in Spark user group seen this error in case?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 20 Dec 2018 at 09:38,  wrote:

> Hi,
>
> I am trying a basic Spark job in Scala program. I compile it with SBT with
> the following dependencies
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" %
> "1.6.1" % "provided"
> libraryDependencies += "org.apache.phoenix" % "phoenix-spark" %
> "4.6.0-HBase-1.0"
> libraryDependencies += "org.apache.hbase" % "hbase" % "1.2.6"
> libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.2.6"
> libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.2.6"
> libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.2.6"
> libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" %
> "2.2.0"
> libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.8.1"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-twitter" %
> "1.6.3"
> libraryDependencies += "com.google.cloud.bigdataoss" %
> "bigquery-connector" % "0.13.4-hadoop3"
> libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" %
> "1.9.4-hadoop3"
> libraryDependencies += "com.google.code.gson" % "gson" % "2.8.5"
> libraryDependencies += "com.google.guava" % "guava" % "27.0.1-jre"
> libraryDependencies += "org.apache.httpcomponents" % "httpcore" % "4.4.8"
>
> It compiles fine and creates the Uber jar file. But when I run I get the
> following error.
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
> at
> com.google.cloud.hadoop.io.bigquery.BigQueryStrings.parseTableReference(BigQueryStrings.java:68)
> at
> com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration.configureBigQueryInput(BigQueryConfiguration.java:260)
> at simple$.main(simple.scala:150)
> at simple.main(simple.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Sounds like there is incompatibility in GUAVA versions between compiles
> and run? These are the versions thar are used:
>
>
>- Java openjdk version "1.8.0_181"
>- Spark version 2.3.2
>- Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
>
>
> Appreciate any feedback.
>
> Thanks,
>
> Mich
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google Cloud Dataproc Discussions" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cloud-dataproc-discuss+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cloud-dataproc-discuss/12ea0075-c18f-4c46-adbf-958ca24730d1%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


Re: [SPARK SQL] Difference between 'Hive on spark' and Spark SQL

2018-12-20 Thread Jörn Franke
If you have already a lot of queries then it makes sense to look at Hive (in a 
recent version)+TEZ+Llap and all tables in ORC format partitioned and sorted on 
filter columns. That would be the most easiest way and can improve performance 
significantly .

If you want to use Spark, eg because you want to use additional features and it 
could become part of your strategy justifying the investment:
* hive on Spark - I don’t think it is as much used as the above combination. I 
am also not sure if it supports recent Spark versions and all Hive features. It 
would also not really allow you to use Spark features beyond Hive features . 
Basically you just set a different engine in Hive and execute the queries as 
you do now. 
* spark.sql : you would have to write all your Hive queries as Spark queries 
and potentially integrate or rewrite HiveUdfs. Given that you can use 
HiveContext to execute queries it may not require so much effort to rewrite 
then. The pushdown possibilities are available in Spark. You have to write 
Spark programs to execute queries. There are some servers that you can connect 
to using SQL queries but their maturity varies.

In the end you have to make an assessment of all your queries and investigate 
if they can be executed using either of the options

> Am 20.12.2018 um 08:17 schrieb l...@china-inv.cn:
> 
> Hi, All, 
> 
> We are starting to migrate our data to Hadoop platform in hoping to use 'Big 
> Data' technologies to  
> improve our business. 
> 
> We are new in the area and want to get some help from you. 
> 
> Currently all our data is put into Hive and some complicated SQL query 
> statements are run daily. 
> 
> We want to improve the performance of these queries and have two options at 
> hand: 
> a. Turn on 'Hive on spark' feature and run HQLs and 
> b. Run those query statements with spark SQL 
> 
> What the difference between these options? 
> 
> Another question is: 
> There is a hive setting 'hive.optimze.ppd' to enable 'predicated pushdown' 
> query optimize 
> Is ther equivalent option in spark sql or the same setting also works for 
> spark SQL? 
> 
> Thanks in advance 
> 
> Boying 
> 
> 
>
> 本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。
> 
>   
> This email message may contain confidential and/or privileged information. If 
> you are not the intended recipient, please do not read, save, forward, 
> disclose or copy the contents of this email or open any file attached to this 
> email. We will be grateful if you could advise the sender immediately by 
> replying this email, and delete this email and any attachment or links to 
> this email completely and immediately from your computer system.
> 
> 
>