Re: AWS credentials needed while trying to read a model from S3 in Spark

2018-05-09 Thread Srinath C
You could use IAM roles in AWS to access the data in S3 without credentials.
See this link

and
this link

for an example.

Regards,
Srinath.


On Thu, May 10, 2018 at 7:04 AM, Mina Aslani  wrote:

> Hi,
>
> I am trying to load a ML model from AWS S3 in my spark app running in a
> docker container, however I need to pass the AWS credentials.
> My questions is, why do I need to pass the credentials in the path?
> And what is the workaround?
>
> Best regards,
> Mina
>


AWS credentials needed while trying to read a model from S3 in Spark

2018-05-09 Thread Mina Aslani
Hi,

I am trying to load a ML model from AWS S3 in my spark app running in a
docker container, however I need to pass the AWS credentials.
My questions is, why do I need to pass the credentials in the path?
And what is the workaround?

Best regards,
Mina


Making spark streaming application single threaded

2018-05-09 Thread ravidspark
Hi All,

Is there any property which makes my spark streaming application a single
threaded? 

I researched on this property, *spark.dynamicAllocation.maxExecutors=1*, but
as far as I understand this launches a maximum of one container but not a
single thread. In local mode, we can configure the number of threads using
local[*]. But, how can I do the same in cluster mode? 

I am trying to read data from Kafka and I see in my logs, every Kafka
message is being read 3 times. I wanted this to be read only once. How can I
achieve this?


Thanks in advance, 
Ravi



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Structured-Streaming][Beginner] Out of order messages with Spark kafka readstream from a specific partition

2018-05-09 Thread karthikjay
On the producer side, I make sure data for a specific user lands on the same
partition. On the consumer side, I use a regular Spark kafka readstream and
read the data. I also use a console write stream to print out the spark
kafka DataFrame. What I observer is, the data for a specific user (even
though in the same partition) arrives out of order in the console. 

I also verified the data ordering by running a simple Kafka consumer in Java
and the data seems to be ordered. What am I missing here ?

Thanks,
JK



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Problem with Spark Master shutting down when zookeeper leader is shutdown

2018-05-09 Thread agateaaa
Dear Spark community,

Just wanted to bring this issue up which was filed for Spark 1.6.1 (
https://issues.apache.org/jira/browse/SPARK-15544) but also exists in Spark
2.3.0 (https://issues.apache.org/jira/browse/SPARK-23530)

We have run into this on production, where Spark Master shuts down if the
Zookeeper leader on another node is shutdown during our upgrade procedure.
Actually this is a serious issue in our opinion and defeats the purpose of
Spark being Highly Available.
Rest of the software components like Kafka are not affected by zookeeper
leader shut down.

The problem manifests in unusual way, since it affects not the node that is
being rebooted or upgraded but some other node in the cluster and it  can
go unnoticed, unless we are actively monitoring for this to happen on other
nodes during upgrade.

(BTW by upgrade we mean upgrade of our application software stack, which
might include changes to base operating system packages, not Spark version
upgrade)

Can we increase the priortiy of these two JIRA's or better still can
someone pick this issue up please?

Thank you
Ashwin


Livy Failed error on Yarn with Spark

2018-05-09 Thread Chetan Khatri
All,

I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I am
running same spark job using spark-submit it is getting success with all
transformations are done.

When I am trying to do spark submit using Livy, at that time Spark Job is
getting invoked and getting success but Yarn status says : FAILED and when
you take a look on logs at attempt : Log says  SUCCESS and there is no
error log.

Any one has faced this weird exprience ?

Thank you.


Re: Spark UI Source Code

2018-05-09 Thread Marcelo Vanzin
(-dev)

The KVStore API is private to Spark, it's not really meant to be used
by others. You're free to try, and there's a lot of javadocs on the
different interfaces, but it's not a general purpose database, so
you'll need to figure out things like that by yourself.

On Tue, May 8, 2018 at 9:53 PM, Anshi Shrivastava
 wrote:
> Hi Marcelo, Dev,
>
> Thanks for your response.
> I have used SparkListeners to fetch the metrics (the public REST API uses
> the same) but to monitor these metrics over time, I have to persist them
> (using KVStore library of spark).  Is there a way to fetch data from this
> KVStore (which uses levelDb for storage) and filter it on basis on
> timestamp?
>
> Thanks,
> Anshi
>
> On Mon, May 7, 2018 at 9:51 PM, Marcelo Vanzin [via Apache Spark User List]
>  wrote:
>>
>> On Mon, May 7, 2018 at 1:44 AM, Anshi Shrivastava
>> <[hidden email]> wrote:
>> > I've found a KVStore wrapper which stores all the metrics in a LevelDb
>> > store. This KVStore wrapper is available as a spark-dependency but we
>> > cannot
>> > access the metrics directly from spark since they are all private.
>>
>> I'm not sure what it is you're trying to do exactly, but there's a
>> public REST API that exposes all the data Spark keeps about
>> applications. There's also a programmatic status tracker
>> (SparkContext.statusTracker) that's easier to use from within the
>> running Spark app, but has a lot less info.
>>
>> > Can we use this store to store our own metrics?
>>
>> No.
>>
>> > Also can we retrieve these metrics based on timestamp?
>>
>> Only if the REST API has that feature, don't remember off the top of my
>> head.
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: [hidden email]
>>
>>
>>
>> 
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Spark-UI-Source-Code-tp32114.html
>> To start a new topic under Apache Spark User List, email
>> ml+s1001560n1...@n3.nabble.com
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>
>
>
>
> DISCLAIMER:
> All the content in email is intended for the recipient and not to be
> published elsewhere without Exadatum consent. And attachments shall be send
> only if required and with ownership of the sender. This message contains
> confidential information and is intended only for the individual named. If
> you are not the named addressee, you should not disseminate, distribute or
> copy this email. Please notify the sender immediately by email if you have
> received this email by mistake and delete this email from your system. Email
> transmission cannot be guaranteed to be secure or error-free, as information
> could be intercepted, corrupted, lost, destroyed, arrive late or incomplete,
> or contain viruses. The sender, therefore, does not accept liability for any
> errors or omissions in the contents of this message which arise as a result
> of email transmission. If verification is required, please request a
> hard-copy version.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fwd: Array[Double] two time slower then DenseVector

2018-05-09 Thread David Ignjić
Hello all,
I am currently looking in 1 spark application to squeze little performance
and here this code (attached in email)

I looked in difference and in:
org.apache.spark.sql.catalyst.CatalystTypeConverters.ArrayConverter
if its primitive we still use boxing and unboxing version because in code
org.apache.spark.sql.catalyst.util.ArrayData#toArray
we don't use method :  ArrayData .toDoubleArray as its used in VectorUDT.

Now is the question do i need to provide patch or someone can me show it
how to get same performance with array as with dense vector.
Or i need to create jira ticket


Thanks
 import org.apache.spark.ml.linalg.{DenseVector, Vectors}
 import scala.util.Random
 import spark.implicits._
 
 val dotVector = udf {(x:DenseVector,y:DenseVector) => {
var i = 0; var dotProduct = 0.0
val size = x.size;val v1 = x.values; val v2 = y.values
while (i < size) {
  dotProduct += v1(i) * v2(i)
  i += 1
}
dotProduct}}
val dotSeq = udf {(x:Seq[Double],y:Seq[Double]) => {
var i = 0;var dotProduct = 0.0;val size = x.size
while (i < size) {
  dotProduct += x(i) * y(i)
  i += 1
}
dotProduct}}
def time(name: String, block: => Unit): Float = {
val t0 = System.nanoTime()
block // call-by-name
val t1 = System.nanoTime()
//println(s"$name: " + (t1 - t0) / 10f + "s")
((t1 - t0)/ 10f )
 }
val densevector = udf { (p: Seq[Float]) =>  Vectors.dense(p.map(_.toDouble).toArray)   }
val genVec = udf { (l:Int,c:Int) => {
 val r = new Random(l*c)
 (1 to 300).map(p => r.nextDouble()).toArray}
 }

val dfBig = {Seq(1).toDF("s")
.withColumn("line",explode(lit((1 to 1000).toArray)))
.withColumn("column",explode(lit((1 to 200).toArray)))
.withColumn("v1",genVec(col("line").+(lit(22)).*(lit(-1)),col("column")))
.withColumn("v2",genVec(col("line"),col("column")))
.withColumn("v1d",densevector(col("v1")))
.withColumn("v2d",densevector(col("v2")))
.repartition(1)
.persist()}
dfBig.count
dfBig.show(10)

val arrayTime =(1 to 20).map {p=> time("array",dfBig.withColumn("dot",dotSeq(col("v1"),col("v2"))).sort(desc("dot")).limit(10).collect())}.sum /20
val vectorTime = (1 to 20).map {p=> time("array",dfBig.withColumn("dot",dotVector(col("v1d"),col("v2d"))).sort(desc("dot")).limit(10).collect())}.sum / 20
vectorTime/ arrayTime *100
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Invalid Spark URL: spark://HeartbeatReceiver@hostname

2018-05-09 Thread Serkan TAS
While trying to execute python script with  pycharm on Windows version am 
getting this error.

Anyone has and ideaabout the error ?

Spark version : 2.3.0

py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver@





ENERJİSA


serkan@enerjisa.com
www.enerjisa.com.tr

[Description: Description: Açıklama: 
Tick-Tock-Boom-Facebook] [Description: 
Description: Açıklama: Tick-Tock-Boom-GooglePlus] 
  [Description: 
Description: Açıklama: Tick-Tock-Boom-Youtube] 



[cid:image4bf178.JPG@f757178a.438d1ea5]






Bu ileti hukuken korunmuş, gizli veya ifşa edilmemesi gereken bilgiler 
içerebilir. Şayet mesajın gönderildiği kişi değilseniz, bu iletiyi çoğaltmak ve 
dağıtmak yasaktır. Bu mesajı yanlışlıkla alan kişi, bu durumu derhal gönderene 
telefonla ya da e-posta ile bildirmeli ve bilgisayarından silmelidir. Bu 
iletinin içeriğinden yalnızca iletiyi gönderen kişi sorumludur.

This communication may contain information that is legally privileged, 
confidential or exempt from disclosure. If you are not the intended recipient, 
please note that any dissemination, distribution, or copying of this 
communication is strictly prohibited. Anyone who receives this message in error 
should notify the sender immediately by telephone or by return communication 
and delete it from his or her computer. Only the person who has sent this 
message is responsible for its content.


Malformed URL Exception when connecting to Phoenix to Spark

2018-05-09 Thread Alchemist
CodeJavaSparkContext sc = new JavaSparkContext(sparkConf);  SQLContext 
sqlContext = new org.apache.spark.sql.SQLContext(sc);  
   Map map = new HashMap();   map.put("zkUrl", 
args[2]);   map.put("table", args[1]);   map.put("driver", 
"org.apache.phoenix.jdbc.PhoenixDriver");  // long cnt = 
sqlContext.read().format("org.apache.phoenix.spark").options(map).load().count()
   DataFrameReader reader = sqlContext.read();   DataFrameReader readerM = 
reader.format("org.apache.phoenix.spark");   DataFrameReader readerM2 = 
readerM.options(map);   Dataset ds = readerM2.load();   ds.logicalPlan();  
 long cnt = ds.count();  // 
format("org.apache.phoenix.spark").options(map).load().count(); 
System.out.println(" cnt " + cnt);  Exception18/05/09 12:31:23 INFO 
RecoverableZooKeeper: Process identifier=hconnection-0x5ebbde60 connecting to 
ZooKeeper ensemble=10.16.129.152:2181Exception in thread "main" 
java.sql.SQLNonTransientConnectionException: Cannot load connection class 
because of underlying exception: 
com.mysql.cj.core.exceptions.WrongArgumentException: Malformed database URL, 
failed to parse the main URL sections. at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:526) at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513) at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505) at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479) at 
com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:489) at 
com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:72)
 at 
com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:124)
 at 
com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:224) 
at java.sql.DriverManager.getConnection(DriverManager.java:664) at 
java.sql.DriverManager.getConnection(DriverManager.java:208) at 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
 at 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
 at 
org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
 at 
org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:338)
 at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:118) at 
org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:60) at 
org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:77)
 at 
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:429)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172) at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) at 
PhoenixToDataFrame.main(PhoenixToDataFrame.java:41) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
com.mysql.cj.core.exceptions.UnableToConnectException: Cannot load connection 
class because of underlying exception: 
com.mysql.cj.core.exceptions.WrongArgumentException: Malformed database URL, 
failed to parse the main URL sections. at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
com.mysql.cj.core.exceptions.ExceptionFactory.createException(ExceptionFactory.java:54)
 at 
com.mysql.cj.core.exceptions.ExceptionFactory.createException(ExceptionFactory.java:93)
 ... 23 moreCaused by: com.mysql.cj.core.exceptions.WrongArgumentException: 
Malformed database URL, failed to parse the main URL sections. at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
com.mysql.cj.core.exceptions.ExceptionFactory.createException(ExceptionFactory.java:54)
 at 

Spark 2.3.0 --files vs. addFile()

2018-05-09 Thread Marius

Hey,

i am using Spark to distribute the execution of a binary tool and to do 
some further calculation further down stream. I want to distribute the 
binary tool using either the --files or the addFile option from spark to 
make it available on each worker node. However although he tells my that 
he added the file:
2018-05-09 07:42:19 INFO  SparkContext:54 - Added file 
s3a://executables/blastp at s3a://executables/foo with timestamp 
1525851739972
2018-05-09 07:42:20 INFO  Utils:54 - Fetching s3a://executables/foo to 
/tmp/spark-54931ea6-b3d6-419b-997b-a498da898b77/userFiles-5e4b66e5-de4a-4420-a641-4453b9ea2ead/fetchFileTemp3437582648265876247.tmp


However when i want to execute the tool using pipe it does not work. I 
currently assume that the file is only downloaded to the master node. 
However i am not sure if i misunderstood the concept of adding files in 
spark or if i did something wrong.
I am getting the path with Sparkfiles.get(). It does work but the bin is 
not there.


This is my call:

spark-submit \
--class de.jlu.bioinfsys.sparkBlast.run.Run \
--master $master \
--jars${awsPath},${awsJavaSDK} \
--files 
s3a://database/a.a.z,s3a://database/a.a.y,s3a://database/a.a.x,s3a://executables/tool
 \
--conf spark.executor.extraClassPath=${awsPath}:${awsJavaSDK} \
--conf spark.driver.extraClassPath=${awsPath}:${awsJavaSDK} \
--conf 
spark.hadoop.fs.s3a.endpoint=https://s3.computational.bio.uni-giessen.de/ \
--conf spark.hadoop.fs.s3a.access.key=$s3Access \
--conf spark.hadoop.fs.s3a.secret.key=$s3Secret \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
${execJarPath}

I am using Spark v 2.3.0 along with scala in Standalone cluster node 
with three workers.


Cheers
Marius





Spark 2.3.0 Structured Streaming Kafka Timestamp

2018-05-09 Thread Yuta Morisawa

Hi All

I'm trying to extract Kafka-timestamp from Kafka topics.

The timestamp does not contain milli-seconds information,
but it should contain because ConsumerRecord class of Kafka 0.10 
supports milli-second timestamp.


How can I get milli-second timestamp from Kafka topics?


These are websites I refer to.

https://spark.apache.org/docs/2.3.0/structured-streaming-kafka-integration.html

https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/streams/processor/TimestampExtractor.html


And this is my code.

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1,topic2")
  .load()
  .selectExpr("CAST(timestamp AS LONG)", "CAST(value AS STRING)")
  .as[(Long, String)]


Regards,
Yuta


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org