SPARK-19547

2017-06-07 Thread Rastogi, Pankaj
Hi,
 I have been trying to distribute Kafka topics among different instances of 
same consumer group. I am using KafkaDirectStream API for creating DStreams. 
After the second consumer group comes up, Kafka does partition rebalance and 
then Spark driver of the first consumer dies with the following exception:

java.lang.IllegalStateException: No current assignment for partition myTopic_5-0
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:264)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:336)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1236)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:42)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.TransformedDStream.createRDDWithLocalProperties(TransformedDStream.scala:65)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 

Spark History Server does not redirect to Yarn aggregated logs for container logs

2017-06-07 Thread ckhari4u
Hey Guys,

I am hitting the below issue when trying to access the STDOUT/STDERR logs in
Spark History Server for the executors of a Spark application executed in
Yarn mode. I have enabled Yarn log aggregation. 

Repro Steps:

1) Run the spark-shell in yarn client mode. Or run Pi job in Yarn mode. 
2) Once the job is completed, (in the case of spark shell, exit after doing
some simple operations), try to access the STDOUT or STDERR logs of the
application from the Executors tab in the Spark History Server UI. 
3) If yarn log aggregation is enabled, then logs won't be available in node
manager's log location. But history Server is trying to access the logs from
the nodemanager's log
location({yarn.nodemanager.log-dirs}/application_${appid) giving below error
in the UI:



Failed redirect for container_e31_1496881617682_0003_01_02
ResourceManager
RM Home
NodeManager
Tools
Failed while trying to construct the redirect url to the log server. Log
Server url may not be configured
java.lang.Exception: Unknown container. Container either has not started or
has already completed or doesn't belong to this node at all.


Either Spark History Server should be able to read from the aggregated logs
and display the logs in the UI or it should give a graceful message. As of
now its redirecting to the NM webpage and trying to fetch the logs from the
node managers local location. 





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-History-Server-does-not-redirect-to-Yarn-aggregated-logs-for-container-logs-tp21706.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-07 Thread vaquar khan
+1 non-binding

Regards,
vaquar khan

On Jun 7, 2017 4:32 PM, "Ricardo Almeida" 
wrote:

+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
-Phive-thriftserver -Pscala-2.11 on

   - Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
   - macOS 10.12.5 Java 8 (build 1.8.0_131)


On 5 June 2017 at 21:14, Michael Armbrust  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc4
>  (377cfa8ac7ff7a8
> a6a6d273182e18ea7dc25ce7e)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1241/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-07 Thread Ricardo Almeida
+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
-Phive-thriftserver -Pscala-2.11 on

   - Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
   - macOS 10.12.5 Java 8 (build 1.8.0_131)


On 5 June 2017 at 21:14, Michael Armbrust  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc4
>  (377cfa8ac7ff7a8
> a6a6d273182e18ea7dc25ce7e)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1241/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>


Will higher order functions in spark SQL be pushed upstream?

2017-06-07 Thread Antoine HOM
Hey guys,

Databricks released higher order functions as part of their runtime
3.0 beta 
(https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html),
which helps working with array within SQL statements.

* As a heavy user of complex data types I was wondering if there was
any plan to push those changes upstream?
* In addition, I was wondering if as part of this change it also tries
to solve the column pruning / filter pushdown issues with complex
datatypes?

Thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is static volatile variable different with static variable in the closure?

2017-06-07 Thread Sean Owen
static and volatile are unrelated. Being volatile doesn't change the
properties of the variable with respect to being static.

On Wed, Jun 7, 2017 at 4:01 PM Chang Chen  wrote:

> Static variable  will be initialized in worker node JVM, will not be
> serialized from master. But how about static volatile variable?
>
> Recently I read the beam spark runner code, and i find that they use
> static volatile Broadcast variable. See
> https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/util/GlobalWatermarkHolder.java#L46
>
> It confused me a lot, is static volatile variable different with static
> variable in the closure?
>


Is static volatile variable different with static variable in the closure?

2017-06-07 Thread Chang Chen
Static variable  will be initialized in worker node JVM, will not be
serialized from master. But how about static volatile variable?

Recently I read the beam spark runner code, and i find that they use static
volatile Broadcast variable. See
https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/util/GlobalWatermarkHolder.java#L46

It confused me a lot, is static volatile variable different with static
variable in the closure?