Re: The Spark email setting should be update

2023-04-19 Thread Jonathan Kelly
In Gmail, if I click the Reply button at the bottom of this thread, it
defaults to sending the reply only to the individual who sent the last
message. Similarly, if I click the Reply arrow button to the right of each
message, it responds only to the person who sent that message.

In order to respond to the list, I had to click "Reply All", move the list
to the To field and remove everybody else.

Is this the same issue you are talking about, Jia?

~ Jonathan Kelly

On Wed, Apr 19, 2023 at 3:29 PM Rui Wang  wrote:

> I am replying now and the default address is dev@spark.apache.org.
>
>
> -Rui
>
> On Mon, Apr 17, 2023 at 4:27 AM Jia Fan  wrote:
>
>> Hi, everyone.
>>
>> I find that every time I reply to dev's mailing list, the default address
>> of the reply is the sender of the mail, not dev@spark.apache.org. It
>> caused me to think that the email reply to dev was successful several
>> times, but it wasn't. This should not be a common problem, because when I
>> reply to emails from other communities, the default reply address is
>> d...@xxx.apache.org. Can spark modify the corresponding settings to
>> reduce the chance of developers replying incorrectly?
>>
>> Thanks
>>
>>
>> 
>>
>>
>> Jia Fan
>>
>


Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-03 Thread Jonathan Kelly
Small correction: I found a mention of it on
https://github.com/apache/spark/pull/39807 from a month ago.

On Fri, Mar 3, 2023 at 9:44 AM Jonathan Kelly 
wrote:

> So did I... :-( However, there had been no new JIRA issue or PR that has
> mentioned this test case specifically, until
> https://issues.apache.org/jira/browse/SPARK-42665, created just a minute
> ago.
>
> On Fri, Mar 3, 2023 at 5:13 AM Sean Owen  wrote:
>
>> Oh OK, I thought this RC was meant to fix that.
>>
>> On Fri, Mar 3, 2023 at 12:35 AM Jonathan Kelly 
>> wrote:
>>
>>> I see that one too but have not investigated it myself. In the RC1
>>> thread, it was mentioned that this occurs when running the tests via Maven
>>> but not via SBT. Does the test class path get set up differently when
>>> running via SBT vs. Maven?
>>>
>>> On Thu, Mar 2, 2023 at 5:37 PM Sean Owen  wrote:
>>>
>>>> Thanks, that's good to know. The workaround (deleting the thriftserver
>>>> target dir) works for me. Who knows?
>>>>
>>>> But I'm also still seeing:
>>>>
>>>> - simple udf *** FAILED ***
>>>>   io.grpc.StatusRuntimeException: INTERNAL:
>>>> org.apache.spark.sql.ClientE2ETestSuite
>>>>   at io.grpc.Status.asRuntimeException(Status.java:535)
>>>>   at
>>>> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>>>>   at org.apache.spark.sql.connect.client.SparkResult.org
>>>> $apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
>>>>   at
>>>> org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
>>>>   at
>>>> org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123)
>>>>   at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426)
>>>>   at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747)
>>>>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425)
>>>>   at
>>>> org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85)
>>>>   at
>>>> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>>>>
>>>> On Thu, Mar 2, 2023 at 4:38 PM Jonathan Kelly 
>>>> wrote:
>>>>
>>>>> Yes, this issue has driven me quite crazy as well! I hit this issue
>>>>> for a long time when compiling the master branch and running tests.
>>>>> Strangely, it would only occur, as you say, when running the tests and not
>>>>> during an initial build that skips running the tests. (However, I have 
>>>>> seen
>>>>> instances where it does occur even in the initial build with tests 
>>>>> skipped,
>>>>> but only on AWS CodeBuild, not when building locally or on Amazon Linux.)
>>>>>
>>>>> I thought for a long time that I was alone in this bizarre issue, but
>>>>> I eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
>>>>> SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but
>>>>> both are unfortunately still open.
>>>>>
>>>>> I found at one point that the issue magically disappeared once
>>>>> [SPARK-41408] <https://issues.apache.org/jira/browse/SPARK-41408>[BUILD]
>>>>> Upgrade scala-maven-plugin to 4.8.0
>>>>> <https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
>>>>>  was
>>>>> merged, but then it cropped back up again at some point after that, and I
>>>>> used git bisect to find that the issue appeared again when
>>>>> [SPARK-27561] <https://issues.apache.org/jira/browse/SPARK-27561>[SQL]
>>>>> Support implicit lateral column alias resolution on Project
>>>>> <https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
>>>>>  was
>>>>> merged. This commit didn't even directly affect anything in
>>>>> hive-thriftserver, but it does make some pretty big changes to pretty core
>>>>> classes in sql/catalyst, so it's not too surprising that this could 
>>>>> trigger
>>>>> an issue that seems to have to do with "very complicated inheritance
>>>>> hierarchies involving both Java and Scala", which is a phrase mentioned on
>>>>> sbt#6183 <https://github.com/sbt/sbt/issues/6183>.
>&g

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-03 Thread Jonathan Kelly
So did I... :-( However, there had been no new JIRA issue or PR that has
mentioned this test case specifically, until
https://issues.apache.org/jira/browse/SPARK-42665, created just a minute
ago.

On Fri, Mar 3, 2023 at 5:13 AM Sean Owen  wrote:

> Oh OK, I thought this RC was meant to fix that.
>
> On Fri, Mar 3, 2023 at 12:35 AM Jonathan Kelly 
> wrote:
>
>> I see that one too but have not investigated it myself. In the RC1
>> thread, it was mentioned that this occurs when running the tests via Maven
>> but not via SBT. Does the test class path get set up differently when
>> running via SBT vs. Maven?
>>
>> On Thu, Mar 2, 2023 at 5:37 PM Sean Owen  wrote:
>>
>>> Thanks, that's good to know. The workaround (deleting the thriftserver
>>> target dir) works for me. Who knows?
>>>
>>> But I'm also still seeing:
>>>
>>> - simple udf *** FAILED ***
>>>   io.grpc.StatusRuntimeException: INTERNAL:
>>> org.apache.spark.sql.ClientE2ETestSuite
>>>   at io.grpc.Status.asRuntimeException(Status.java:535)
>>>   at
>>> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>>>   at org.apache.spark.sql.connect.client.SparkResult.org
>>> $apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
>>>   at
>>> org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
>>>   at
>>> org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123)
>>>   at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426)
>>>   at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747)
>>>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425)
>>>   at
>>> org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85)
>>>   at
>>> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>>>
>>> On Thu, Mar 2, 2023 at 4:38 PM Jonathan Kelly 
>>> wrote:
>>>
>>>> Yes, this issue has driven me quite crazy as well! I hit this issue for
>>>> a long time when compiling the master branch and running tests. Strangely,
>>>> it would only occur, as you say, when running the tests and not during an
>>>> initial build that skips running the tests. (However, I have seen instances
>>>> where it does occur even in the initial build with tests skipped, but only
>>>> on AWS CodeBuild, not when building locally or on Amazon Linux.)
>>>>
>>>> I thought for a long time that I was alone in this bizarre issue, but I
>>>> eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
>>>> SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but
>>>> both are unfortunately still open.
>>>>
>>>> I found at one point that the issue magically disappeared once
>>>> [SPARK-41408] <https://issues.apache.org/jira/browse/SPARK-41408>[BUILD]
>>>> Upgrade scala-maven-plugin to 4.8.0
>>>> <https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
>>>>  was
>>>> merged, but then it cropped back up again at some point after that, and I
>>>> used git bisect to find that the issue appeared again when
>>>> [SPARK-27561] <https://issues.apache.org/jira/browse/SPARK-27561>[SQL]
>>>> Support implicit lateral column alias resolution on Project
>>>> <https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
>>>>  was
>>>> merged. This commit didn't even directly affect anything in
>>>> hive-thriftserver, but it does make some pretty big changes to pretty core
>>>> classes in sql/catalyst, so it's not too surprising that this could trigger
>>>> an issue that seems to have to do with "very complicated inheritance
>>>> hierarchies involving both Java and Scala", which is a phrase mentioned on
>>>> sbt#6183 <https://github.com/sbt/sbt/issues/6183>.
>>>>
>>>> One thing that I did find to help was to
>>>> delete sql/hive-thriftserver/target between building Spark and running the
>>>> tests. This helps in my builds where the issue only occurs during the
>>>> testing phase and not during the initial build phase, but of course it
>>>> doesn't help in my builds where the issue occurs during that first build
>>>> phase.
>>>>
>>>> ~ Jonathan Kelly
>>>>
>>>> On Th

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Jonathan Kelly
I see that one too but have not investigated it myself. In the RC1 thread,
it was mentioned that this occurs when running the tests via Maven but not
via SBT. Does the test class path get set up differently when running via
SBT vs. Maven?

On Thu, Mar 2, 2023 at 5:37 PM Sean Owen  wrote:

> Thanks, that's good to know. The workaround (deleting the thriftserver
> target dir) works for me. Who knows?
>
> But I'm also still seeing:
>
> - simple udf *** FAILED ***
>   io.grpc.StatusRuntimeException: INTERNAL:
> org.apache.spark.sql.ClientE2ETestSuite
>   at io.grpc.Status.asRuntimeException(Status.java:535)
>   at
> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>   at org.apache.spark.sql.connect.client.SparkResult.org
> $apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
>   at
> org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
>   at
> org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123)
>   at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426)
>   at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425)
>   at
> org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>
> On Thu, Mar 2, 2023 at 4:38 PM Jonathan Kelly 
> wrote:
>
>> Yes, this issue has driven me quite crazy as well! I hit this issue for a
>> long time when compiling the master branch and running tests. Strangely, it
>> would only occur, as you say, when running the tests and not during an
>> initial build that skips running the tests. (However, I have seen instances
>> where it does occur even in the initial build with tests skipped, but only
>> on AWS CodeBuild, not when building locally or on Amazon Linux.)
>>
>> I thought for a long time that I was alone in this bizarre issue, but I
>> eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
>> SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but
>> both are unfortunately still open.
>>
>> I found at one point that the issue magically disappeared once
>> [SPARK-41408] <https://issues.apache.org/jira/browse/SPARK-41408>[BUILD]
>> Upgrade scala-maven-plugin to 4.8.0
>> <https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
>>  was
>> merged, but then it cropped back up again at some point after that, and I
>> used git bisect to find that the issue appeared again when [SPARK-27561]
>> <https://issues.apache.org/jira/browse/SPARK-27561>[SQL] Support
>> implicit lateral column alias resolution on Project
>> <https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
>>  was
>> merged. This commit didn't even directly affect anything in
>> hive-thriftserver, but it does make some pretty big changes to pretty core
>> classes in sql/catalyst, so it's not too surprising that this could trigger
>> an issue that seems to have to do with "very complicated inheritance
>> hierarchies involving both Java and Scala", which is a phrase mentioned on
>> sbt#6183 <https://github.com/sbt/sbt/issues/6183>.
>>
>> One thing that I did find to help was to
>> delete sql/hive-thriftserver/target between building Spark and running the
>> tests. This helps in my builds where the issue only occurs during the
>> testing phase and not during the initial build phase, but of course it
>> doesn't help in my builds where the issue occurs during that first build
>> phase.
>>
>> ~ Jonathan Kelly
>>
>> On Thu, Mar 2, 2023 at 1:47 PM Sean Owen  wrote:
>>
>>> Has anyone seen this behavior -- I've never seen it before. The Hive
>>> thriftserver module for me just goes into an infinite loop when running
>>> tests:
>>>
>>> ...
>>> [INFO] done compiling
>>> [INFO] compiling 22 Scala sources and 24 Java sources to
>>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>>> ...
>>> [INFO] done compiling
>>> [INFO] compiling 22 Scala sources and 9 Java sources to
>>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>>> ...
>>> [WARNING] [Warn]
>>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:25:29:
>>>  [deprecation] GnuParser in org.apache.commons.cli has been deprecated
>>> [WARNING] [Warn]
>>> /

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Jonathan Kelly
Yes, this issue has driven me quite crazy as well! I hit this issue for a
long time when compiling the master branch and running tests. Strangely, it
would only occur, as you say, when running the tests and not during an
initial build that skips running the tests. (However, I have seen instances
where it does occur even in the initial build with tests skipped, but only
on AWS CodeBuild, not when building locally or on Amazon Linux.)

I thought for a long time that I was alone in this bizarre issue, but I
eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but both
are unfortunately still open.

I found at one point that the issue magically disappeared once [SPARK-41408]
<https://issues.apache.org/jira/browse/SPARK-41408>[BUILD] Upgrade
scala-maven-plugin to 4.8.0
<https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
was
merged, but then it cropped back up again at some point after that, and I
used git bisect to find that the issue appeared again when [SPARK-27561]
<https://issues.apache.org/jira/browse/SPARK-27561>[SQL] Support implicit
lateral column alias resolution on Project
<https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
was
merged. This commit didn't even directly affect anything in
hive-thriftserver, but it does make some pretty big changes to pretty core
classes in sql/catalyst, so it's not too surprising that this could trigger
an issue that seems to have to do with "very complicated inheritance
hierarchies involving both Java and Scala", which is a phrase mentioned on
sbt#6183 <https://github.com/sbt/sbt/issues/6183>.

One thing that I did find to help was to
delete sql/hive-thriftserver/target between building Spark and running the
tests. This helps in my builds where the issue only occurs during the
testing phase and not during the initial build phase, but of course it
doesn't help in my builds where the issue occurs during that first build
phase.

~ Jonathan Kelly

On Thu, Mar 2, 2023 at 1:47 PM Sean Owen  wrote:

> Has anyone seen this behavior -- I've never seen it before. The Hive
> thriftserver module for me just goes into an infinite loop when running
> tests:
>
> ...
> [INFO] done compiling
> [INFO] compiling 22 Scala sources and 24 Java sources to
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
> ...
> [INFO] done compiling
> [INFO] compiling 22 Scala sources and 9 Java sources to
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
> ...
> [WARNING] [Warn]
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:25:29:
>  [deprecation] GnuParser in org.apache.commons.cli has been deprecated
> [WARNING] [Warn]
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java:333:18:
>  [deprecation] authorize(UserGroupInformation,String,Configuration) in
> ProxyUsers has been deprecated
> [WARNING] [Warn]
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:110:16:
>  [deprecation] HIVE_SERVER2_THRIFT_HTTP_COOKIE_IS_SECURE in ConfVars has
> been deprecated
> [WARNING] [Warn]
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:553:53:
>  [deprecation] HttpUtils in javax.servlet.http has been deprecated
> [WARNING] [Warn]
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:185:24:
>  [deprecation] OptionBuilder in org.apache.commons.cli has been deprecated
> [WARNING] [Warn]
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:187:10:
>  [static] static method should be qualified by type name, OptionBuilder,
> instead of by an expression
> [WARNING] [Warn]
> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:197:26:
>  [deprecation] GnuParser in org.apache.commons.cli has been deprecated
> ...
>
> ... repeated over and over.
>
> On Thu, Mar 2, 2023 at 6:04 AM Xinrong Meng 
> wrote:
>
>> Please vote on releasing the following candidate(RC2) as Apache Spark
>> version 3.4.0.
>>
>> The vote is open until 11:59pm Pacific time *March 7th* and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Jonathan Kelly
Thanks! I was wondering about that ClientE2ETestSuite failure today, so I'm
glad to know that it's also being experienced by others.

On a similar note, I am experiencing the following error when running the
Python tests with Python 3.7:

+ ./python/run-tests --python-executables=python3
Running PySpark tests. Output is in
/home/ec2-user/spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-connect', 'pyspark-core',
'pyspark-errors', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas',
'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql',
'pyspark-streaming']
python3 python_implementation is CPython
python3 version is: Python 3.7.16
Starting test(python3): pyspark.ml.tests.test_feature (temp output:
/home/ec2-user/spark/python/target/8ca9ab1a-05cc-4845-bf89-30d9001510bc/python3__pyspark.ml.tests.test_feature__kg6sseie.log)
Starting test(python3): pyspark.ml.tests.test_base (temp output:
/home/ec2-user/spark/python/target/f2264f3b-6b26-4e61-9452-8d6ddd7eb002/python3__pyspark.ml.tests.test_base__0902zf9_.log)
Starting test(python3): pyspark.ml.tests.test_algorithms (temp output:
/home/ec2-user/spark/python/target/d1dc4e07-e58c-4c03-abe5-09d8fab22e6a/python3__pyspark.ml.tests.test_algorithms__lh3wb2u8.log)
Starting test(python3): pyspark.ml.tests.test_evaluation (temp output:
/home/ec2-user/spark/python/target/3f42dc79-c945-4cf2-a1eb-83e72b40a9ee/python3__pyspark.ml.tests.test_evaluation__89idc7fa.log)
Finished test(python3): pyspark.ml.tests.test_base (16s)
Starting test(python3): pyspark.ml.tests.test_functions (temp output:
/home/ec2-user/spark/python/target/5a3b90f0-216b-4edd-9d15-6619d3e03300/python3__pyspark.ml.tests.test_functions__g5u1290s.log)
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/ec2-user/spark/python/pyspark/ml/tests/test_functions.py",
line 21, in 
from pyspark.ml.functions import predict_batch_udf
  File "/home/ec2-user/spark/python/pyspark/ml/functions.py", line 38, in

from typing import Any, Callable, Iterator, List, Mapping, Protocol,
TYPE_CHECKING, Tuple, Union
ImportError: cannot import name 'Protocol' from 'typing'
(/usr/lib64/python3.7/typing.py)
Had test failures in pyspark.ml.tests.test_functions with python3; see logs.

I know we should move on to a newer version of Python, but isn't Python 3.7
still officially supported?

Thank you,
Jonathan Kelly

On Wed, Feb 22, 2023 at 1:47 PM Herman van Hovell
 wrote:

> Hi All,
>
> Thanks for testing the 3.4.0 RC! I apologize for the maven testing
> failures for the Spark Connect Scala Client. We will try to get those
> sorted as soon as possible.
>
> This is an artifact of having multiple build systems, and only running CI
> for one (SBT). That, however, is a debate for another day :)...
>
> Cheers,
> Herman
>
> On Wed, Feb 22, 2023 at 5:32 PM Bjørn Jørgensen 
> wrote:
>
>> ./build/mvn clean package
>>
>> I'm using ubuntu rolling, python 3.11 openjdk 17
>>
>> CompatibilitySuite:
>> - compatibility MiMa tests *** FAILED ***
>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>   at scala.Predef$.assert(Predef.scala:223)
>>   at
>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
>>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>>   ...
>> - compatibility API tests: Dataset *** FAILED ***
>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>   at scala.Predef$.assert(Predef.scala:223)
>>   at
>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>   at
>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>   at
&

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-11 Thread Jonathan Kelly
Ah, OK, I didn't realize that you were waiting for something. Will the
v3.3.1-rc3 tag be moved once SPARK-40703 is out? (Is that even possible?)
Or will you just cut rc4 eventually and never vote on rc3?

On Tue, Oct 11, 2022 at 1:14 PM Dongjoon Hyun 
wrote:

> Yes, that's the current status.
>
> FYI, 3.3.1-rc3 tag was created 6 days ago but the vote was not started
> because we are waiting for
>
> https://issues.apache.org/jira/browse/SPARK-40703
>
> Chao Sun pinged the release manager 4 days ago and has been working on it.
> Now, his PR is ready for 3.3.1 release here.
>
> https://github.com/apache/spark/pull/38196/files
>
> BTW, thank you for asking the question, Jonathan.
>
> Dongjoon.
>
>
> On Tue, Oct 11, 2022 at 12:06 PM Jonathan Kelly 
> wrote:
>
>> Yep, makes sense. Thanks for the quick response!
>>
>> On Tue, Oct 11, 2022 at 12:04 PM Sean Owen  wrote:
>>
>>> Actually yeah that is how the release vote works by default at Apache:
>>> https://www.apache.org/foundation/voting.html#ReleaseVotes
>>>
>>> However I would imagine there is broad consent to just roll another RC
>>> if there's any objection or -1. We could formally re-check the votes, as I
>>> think the +1s would agree, but think we've defaulted into accepting a
>>> 'veto' if there are otherwise no objections.
>>>
>>> On Tue, Oct 11, 2022 at 2:01 PM Jonathan Kelly 
>>> wrote:
>>>
>>>> Hi, Yuming,
>>>>
>>>> In your original email, you said that the vote "passes if a majority +1
>>>> PMC votes are cast, with a minimum of 3 +1 votes". There were four +1 votes
>>>> (all from PMC members) and one -1 (also from a PMC member), so shouldn't
>>>> the vote pass because both requirements (majority +1 and minimum of 3 +1
>>>> votes) were met?
>>>>
>>>> I don't personally mind either way if the vote is considered passed or
>>>> failed (and I see you've already cut the v3.3.1-rc3 tag but haven't started
>>>> the new vote yet), but I just wanted to ask for clarification on the
>>>> requirements.
>>>>
>>>> Thank you,
>>>> Jonathan Kelly
>>>>
>>>>
>>>> On Wed, Oct 5, 2022 at 7:49 PM Yuming Wang  wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Thank you all for testing and voting!
>>>>>
>>>>> There's a -1 vote here, so I think this RC fails. I will prepare for
>>>>> RC3 soon.
>>>>>
>>>>> On Tue, Oct 4, 2022 at 6:34 AM Mridul Muralidharan 
>>>>> wrote:
>>>>>
>>>>>> +1 from me, with a few comments.
>>>>>>
>>>>>> I saw the following failures, are these known issues/flakey tests ?
>>>>>>
>>>>>> * PersistenceEngineSuite.ZooKeeperPersistenceEngine
>>>>>> Looks like a port conflict issue from a quick look into logs
>>>>>> (conflict with starting admin port at 8080) - is this expected behavior 
>>>>>> for
>>>>>> the test ?
>>>>>> I worked around it by shutting down the process which was using the
>>>>>> port - though did not investigate deeply.
>>>>>>
>>>>>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite was
>>>>>> aborted
>>>>>> It is expecting these artifacts in $HOME/.m2/repository
>>>>>>
>>>>>> 1. tomcat#jasper-compiler;5.5.23!jasper-compiler.jar
>>>>>> 2. tomcat#jasper-runtime;5.5.23!jasper-runtime.jar
>>>>>> 3. commons-el#commons-el;1.0!commons-el.jar
>>>>>> 4. org.apache.hive#hive-exec;2.3.7!hive-exec.jar
>>>>>>
>>>>>> I worked around it by adding them locally explicitly - we should
>>>>>> probably  add them as test dependency ?
>>>>>> Not sure if this changed in this release though (I had cleaned my
>>>>>> local .m2 recently)
>>>>>>
>>>>>> Other than this, rest looks good to me.
>>>>>>
>>>>>> Regards,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 28, 2022 at 2:56 PM Sean Owen  wrote:
>>>>>>
>>>>>>> +1 from me, same result as last RC.
>>>>>>>
>>>>>>> On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang 
>>>>>>

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-11 Thread Jonathan Kelly
Yep, makes sense. Thanks for the quick response!

On Tue, Oct 11, 2022 at 12:04 PM Sean Owen  wrote:

> Actually yeah that is how the release vote works by default at Apache:
> https://www.apache.org/foundation/voting.html#ReleaseVotes
>
> However I would imagine there is broad consent to just roll another RC if
> there's any objection or -1. We could formally re-check the votes, as I
> think the +1s would agree, but think we've defaulted into accepting a
> 'veto' if there are otherwise no objections.
>
> On Tue, Oct 11, 2022 at 2:01 PM Jonathan Kelly 
> wrote:
>
>> Hi, Yuming,
>>
>> In your original email, you said that the vote "passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes". There were four +1 votes
>> (all from PMC members) and one -1 (also from a PMC member), so shouldn't
>> the vote pass because both requirements (majority +1 and minimum of 3 +1
>> votes) were met?
>>
>> I don't personally mind either way if the vote is considered passed or
>> failed (and I see you've already cut the v3.3.1-rc3 tag but haven't started
>> the new vote yet), but I just wanted to ask for clarification on the
>> requirements.
>>
>> Thank you,
>> Jonathan Kelly
>>
>>
>> On Wed, Oct 5, 2022 at 7:49 PM Yuming Wang  wrote:
>>
>>> Hi All,
>>>
>>> Thank you all for testing and voting!
>>>
>>> There's a -1 vote here, so I think this RC fails. I will prepare for
>>> RC3 soon.
>>>
>>> On Tue, Oct 4, 2022 at 6:34 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>> +1 from me, with a few comments.
>>>>
>>>> I saw the following failures, are these known issues/flakey tests ?
>>>>
>>>> * PersistenceEngineSuite.ZooKeeperPersistenceEngine
>>>> Looks like a port conflict issue from a quick look into logs (conflict
>>>> with starting admin port at 8080) - is this expected behavior for the test 
>>>> ?
>>>> I worked around it by shutting down the process which was using the
>>>> port - though did not investigate deeply.
>>>>
>>>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite was aborted
>>>> It is expecting these artifacts in $HOME/.m2/repository
>>>>
>>>> 1. tomcat#jasper-compiler;5.5.23!jasper-compiler.jar
>>>> 2. tomcat#jasper-runtime;5.5.23!jasper-runtime.jar
>>>> 3. commons-el#commons-el;1.0!commons-el.jar
>>>> 4. org.apache.hive#hive-exec;2.3.7!hive-exec.jar
>>>>
>>>> I worked around it by adding them locally explicitly - we should
>>>> probably  add them as test dependency ?
>>>> Not sure if this changed in this release though (I had cleaned my local
>>>> .m2 recently)
>>>>
>>>> Other than this, rest looks good to me.
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>> On Wed, Sep 28, 2022 at 2:56 PM Sean Owen  wrote:
>>>>
>>>>> +1 from me, same result as last RC.
>>>>>
>>>>> On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang  wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>>>> 3.3.1.
>>>>>>
>>>>>> The vote is open until 11:59pm Pacific time October 3th and passes if a 
>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 3.3.1
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see https://spark.apache.org
>>>>>>
>>>>>> The tag to be voted on is v3.3.1-rc2 (commit 
>>>>>> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
>>>>>> https://github.com/apache/spark/tree/v3.3.1-rc2
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachesp

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-11 Thread Jonathan Kelly
Hi, Yuming,

In your original email, you said that the vote "passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes". There were four +1 votes
(all from PMC members) and one -1 (also from a PMC member), so shouldn't
the vote pass because both requirements (majority +1 and minimum of 3 +1
votes) were met?

I don't personally mind either way if the vote is considered passed or
failed (and I see you've already cut the v3.3.1-rc3 tag but haven't started
the new vote yet), but I just wanted to ask for clarification on the
requirements.

Thank you,
Jonathan Kelly


On Wed, Oct 5, 2022 at 7:49 PM Yuming Wang  wrote:

> Hi All,
>
> Thank you all for testing and voting!
>
> There's a -1 vote here, so I think this RC fails. I will prepare for
> RC3 soon.
>
> On Tue, Oct 4, 2022 at 6:34 AM Mridul Muralidharan 
> wrote:
>
>> +1 from me, with a few comments.
>>
>> I saw the following failures, are these known issues/flakey tests ?
>>
>> * PersistenceEngineSuite.ZooKeeperPersistenceEngine
>> Looks like a port conflict issue from a quick look into logs (conflict
>> with starting admin port at 8080) - is this expected behavior for the test ?
>> I worked around it by shutting down the process which was using the port
>> - though did not investigate deeply.
>>
>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite was aborted
>> It is expecting these artifacts in $HOME/.m2/repository
>>
>> 1. tomcat#jasper-compiler;5.5.23!jasper-compiler.jar
>> 2. tomcat#jasper-runtime;5.5.23!jasper-runtime.jar
>> 3. commons-el#commons-el;1.0!commons-el.jar
>> 4. org.apache.hive#hive-exec;2.3.7!hive-exec.jar
>>
>> I worked around it by adding them locally explicitly - we should
>> probably  add them as test dependency ?
>> Not sure if this changed in this release though (I had cleaned my local
>> .m2 recently)
>>
>> Other than this, rest looks good to me.
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Sep 28, 2022 at 2:56 PM Sean Owen  wrote:
>>
>>> +1 from me, same result as last RC.
>>>
>>> On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang  wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>> 3.3.1.
>>>>
>>>> The vote is open until 11:59pm Pacific time October 3th and passes if a 
>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 3.3.1
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see https://spark.apache.org
>>>>
>>>> The tag to be voted on is v3.3.1-rc2 (commit 
>>>> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
>>>> https://github.com/apache/spark/tree/v3.3.1-rc2
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1421
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
>>>>
>>>> The list of bug fixes going into 3.3.1 can be found at the following URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>>>>
>>>> This release is using the release script of the tag v3.3.1-rc2.
>>>>
>>>>
>>>> FAQ
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>> you can add the staging repository to your projects resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with a out of date RC going forward).
>>>>
>>>> ==

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-19 Thread Jonathan Kelly
The docs link from Reynold's initial email is apparently no longer valid.
He posted an updated link a little later in this same thread.

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs-updated/

On Tue, Jul 19, 2016 at 3:19 PM Holden Karau  wrote:

> -1 : The docs don't seem to be fully built (e.g.
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/streaming-programming-guide.html
> is a zero byte file currently) - although if this is a transient apache
> issue no worries.
>
> On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.0-rc4
>> (e5f8c1117e0c48499f54d62b556bc693435afae0).
>>
>> This release candidate resolves ~2500 issues:
>> https://s.apache.org/spark-2.0.0-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> *https://repository.apache.org/content/repositories/orgapachespark-1192/
>> *
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/
>>
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.x.
>>
>> ==
>> What justifies a -1 vote for this release?
>> ==
>> Critical bugs impacting major functionalities.
>>
>> Bugs already present in 1.x, missing features, or bugs related to new
>> features will not necessarily block this release. Note that historically
>> Spark documentation has been published on the website separately from the
>> main release so we do not need to block the release due to documentation
>> errors either.
>>
>>
>> Note: There was a mistake made during "rc3" preparation, and as a result
>> there is no "rc3", but only "rc4".
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-14 Thread Jonathan Kelly
I see that all blockers targeted for 2.0.0 have either been resolved or
downgraded. Do you have an ETA for the next RC?

Thanks,
Jonathan

On Mon, Jul 11, 2016 at 4:33 AM Sean Owen  wrote:

> Yeah there were already other blockers when the RC was released. This
> one was already noted in this thread. There will another RC soon I'm
> sure. I guess it would be ideal if the remaining blockers were
> resolved one way or the other before that, to make it possible that
> RC3 could be the final release:
>
> SPARK-14808 Spark MLlib, GraphX, SparkR 2.0 QA umbrella
> SPARK-14812 ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final,
> sealed audit
> SPARK-14813 ML 2.0 QA: API: Python API coverage
> SPARK-14816 Update MLlib, GraphX, SparkR websites for 2.0
> SPARK-14817 ML, Graph, R 2.0 QA: Programming guide update and migration
> guide
> SPARK-15124 R 2.0 QA: New R APIs and API docs
> SPARK-15623 2.0 python coverage ml.feature
> SPARK-15630 2.0 python coverage ml root module
>
> These are possibly all or mostly resolved already and have been
> knocking around a while.
>
> In any event, even a DoA RC3 might be useful if it kept up the testing.
>
> Sean
>
> On Mon, Jul 11, 2016 at 11:12 AM, Sun Rui  wrote:
> > -1
> > https://issues.apache.org/jira/browse/SPARK-16379
> >
> > On Jul 6, 2016, at 19:28, Maciej Bryński  wrote:
> >
> > -1
> > https://issues.apache.org/jira/browse/SPARK-16379
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Jonathan Kelly
I'm not sure, but I think it's
https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2.

It would be really nice though to have this whole process better documented
and more "official" than just building from somebody's personal fork of
Hive.

Or is there some way that the Spark community could contribute back these
changes to Hive in such a way that they would accept them into trunk? Then
Spark could depend upon an official version of Hive rather than this fork.

~ Jonathan

On Thu, Jul 7, 2016 at 11:46 AM Marcelo Vanzin  wrote:

> (Actually that's "spark" and not "spark2", so yeah, that doesn't
> really answer the question.)
>
> On Thu, Jul 7, 2016 at 11:38 AM, Marcelo Vanzin 
> wrote:
> > My guess would be
> https://github.com/pwendell/hive/tree/release-1.2.1-spark
> >
> > On Thu, Jul 7, 2016 at 11:37 AM, Zhan Zhang  wrote:
> >> I saw the pom file having hive version as
> >> 1.2.1.spark2. But I cannot find the branch
> in
> >> https://github.com/pwendell/
> >>
> >> Does anyone know where the repo is?
> >>
> >> Thanks.
> >>
> >> Zhan Zhang
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Jonathan Kelly
+1

On Wed, Jun 22, 2016 at 10:41 AM Tim Hunter 
wrote:

> +1 This release passes all tests on the graphframes and tensorframes
> packages.
>
> On Wed, Jun 22, 2016 at 7:19 AM, Cody Koeninger 
> wrote:
>
>> If we're considering backporting changes for the 0.8 kafka
>> integration, I am sure there are people who would like to get
>>
>> https://issues.apache.org/jira/browse/SPARK-10963
>>
>> into 1.6.x as well
>>
>> On Wed, Jun 22, 2016 at 7:41 AM, Sean Owen  wrote:
>> > Good call, probably worth back-porting, I'll try to do that. I don't
>> > think it blocks a release, but would be good to get into a next RC if
>> > any.
>> >
>> > On Wed, Jun 22, 2016 at 11:38 AM, Pete Robbins 
>> wrote:
>> >> This has failed on our 1.6 stream builds regularly.
>> >> (https://issues.apache.org/jira/browse/SPARK-6005) looks fixed in 2.0?
>> >>
>> >> On Wed, 22 Jun 2016 at 11:15 Sean Owen  wrote:
>> >>>
>> >>> Oops, one more in the "does anybody else see this" department:
>> >>>
>> >>> - offset recovery *** FAILED ***
>> >>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>> >>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>> >>>
>> >>>
>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>> >>>
>> >>>
>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>> >>>
>> >>>
>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]
>> >>> was false Recovered ranges are not the same as the ones generated
>> >>> (DirectKafkaStreamSuite.scala:301)
>> >>>
>> >>> This actually fails consistently for me too in the Kafka integration
>> >>> code. Not timezone related, I think.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
OK, JIRA created: https://issues.apache.org/jira/browse/SPARK-16080

Also, after looking at the code a bit I think I see the reason. If I'm
correct, it may actually be a very easy fix.

On Mon, Jun 20, 2016 at 1:21 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> It doesn't hurt to have a bug tracking it, in case anyone else has
> time to look at it before I do.
>
> On Mon, Jun 20, 2016 at 1:20 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
> > Thanks for the confirmation! Shall I cut a JIRA issue?
> >
> > On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> I just tried this locally and can see the wrong behavior you mention.
> >> I'm running a somewhat old build of 2.0, but I'll take a look.
> >>
> >> On Mon, Jun 20, 2016 at 7:04 AM, Jonathan Kelly <jonathaka...@gmail.com
> >
> >> wrote:
> >> > Does anybody have any thoughts on this?
> >> >
> >> > On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly <
> jonathaka...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT (commit
> >> >> bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
> >> >> log4j.properties is
> >> >> not getting picked up in the executor classpath (and driver classpath
> >> >> for
> >> >> yarn-cluster mode), so Hadoop's log4j.properties file is taking
> >> >> precedence
> >> >> in the YARN containers.
> >> >>
> >> >> Spark's log4j.properties file is correctly being bundled into the
> >> >> __spark_conf__.zip file and getting added to the DistributedCache,
> but
> >> >> it is
> >> >> not in the classpath of the executor, as evidenced by the following
> >> >> command,
> >> >> which I ran in spark-shell:
> >> >>
> >> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> >> getClass().getResource("/log4j.properties")).first
> >> >> res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties
> >> >>
> >> >> I then ran the following in spark-shell to verify the classpath of
> the
> >> >> executors:
> >> >>
> >> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> >> System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e
> >> >> =>
> >> >> !e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
> >> >> ...
> >> >>
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >> >>
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
> >> >> /etc/hadoop/conf
> >> >> ...
> >> >>
> >> >> So the JVM has this nonexistent __spark_conf__ directory in the
> >> >> classpath
> >> >> when it should really be __spark_conf__.zip (which is actually a
> >> >> symlink to
> >> >> a directory, despite the .zip filename).
> >> >>
> >> >> % sudo ls -l
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >> >> total 20
> >> >> -rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
> >> >> -rwx-- 1 yarn yarn  594 Jun 18 01:26
> >> >> default_container_executor_session.sh
> >> >> -rwx-- 1 yarn yarn  648 Jun 18 01:26
> default_container_executor.sh
> >> >> -rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
> >> >> lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
> >> >> /mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
> >> >> lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
> >> >> drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp
> >> >>
> >> >> Does anybody know why this is happening? Is this a bug in Spark, or
> is
> >> >> it
> >> >> the JVM doing this (possibly because the extension is .zip)?
> >> >>
> >> >> Thanks,
> >> >> Jonathan
> >>
> >>
> >>
> >> --
> >> Marcelo
>
>
>
> --
> Marcelo
>


Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
Thanks for the confirmation! Shall I cut a JIRA issue?

On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin <van...@cloudera.com> wrote:

> I just tried this locally and can see the wrong behavior you mention.
> I'm running a somewhat old build of 2.0, but I'll take a look.
>
> On Mon, Jun 20, 2016 at 7:04 AM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
> > Does anybody have any thoughts on this?
> >
> > On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly <jonathaka...@gmail.com>
> > wrote:
> >>
> >> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT (commit
> >> bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
> log4j.properties is
> >> not getting picked up in the executor classpath (and driver classpath
> for
> >> yarn-cluster mode), so Hadoop's log4j.properties file is taking
> precedence
> >> in the YARN containers.
> >>
> >> Spark's log4j.properties file is correctly being bundled into the
> >> __spark_conf__.zip file and getting added to the DistributedCache, but
> it is
> >> not in the classpath of the executor, as evidenced by the following
> command,
> >> which I ran in spark-shell:
> >>
> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> getClass().getResource("/log4j.properties")).first
> >> res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties
> >>
> >> I then ran the following in spark-shell to verify the classpath of the
> >> executors:
> >>
> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e =>
> >> !e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
> >> ...
> >>
> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >>
> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
> >> /etc/hadoop/conf
> >> ...
> >>
> >> So the JVM has this nonexistent __spark_conf__ directory in the
> classpath
> >> when it should really be __spark_conf__.zip (which is actually a
> symlink to
> >> a directory, despite the .zip filename).
> >>
> >> % sudo ls -l
> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >> total 20
> >> -rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
> >> -rwx-- 1 yarn yarn  594 Jun 18 01:26
> >> default_container_executor_session.sh
> >> -rwx-- 1 yarn yarn  648 Jun 18 01:26 default_container_executor.sh
> >> -rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
> >> lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
> >> /mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
> >> lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
> >>
> /mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
> >> drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp
> >>
> >> Does anybody know why this is happening? Is this a bug in Spark, or is
> it
> >> the JVM doing this (possibly because the extension is .zip)?
> >>
> >> Thanks,
> >> Jonathan
>
>
>
> --
> Marcelo
>


Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
Does anybody have any thoughts on this?
On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT
> (commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
> log4j.properties is not getting picked up in the executor classpath (and
> driver classpath for yarn-cluster mode), so Hadoop's log4j.properties file
> is taking precedence in the YARN containers.
>
> Spark's log4j.properties file is correctly being bundled into the
> __spark_conf__.zip file and getting added to the DistributedCache, but it
> is not in the classpath of the executor, as evidenced by the following
> command, which I ran in spark-shell:
>
> scala> sc.parallelize(Seq(1)).map(_ =>
> getClass().getResource("/log4j.properties")).first
> res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties
>
> I then ran the following in spark-shell to verify the classpath of the
> executors:
>
> scala> sc.parallelize(Seq(1)).map(_ =>
> System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e =>
> !e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
> ...
>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
> /etc/hadoop/conf
> ...
>
> So the JVM has this nonexistent __spark_conf__ directory in the classpath
> when it should really be __spark_conf__.zip (which is actually a symlink
> to a directory, despite the .zip filename).
>
> % sudo ls -l
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> total 20
> -rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
> -rwx-- 1 yarn yarn  594 Jun 18 01:26
> default_container_executor_session.sh
> -rwx-- 1 yarn yarn  648 Jun 18 01:26 default_container_executor.sh
> -rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
> lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
> /mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
> lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
> /mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
> drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp
>
> Does anybody know why this is happening? Is this a bug in Spark, or is it
> the JVM doing this (possibly because the extension is .zip)?
>
> Thanks,
> Jonathan
>


Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Jonathan Kelly
+1 (non-binding)

On Thu, Jun 16, 2016 at 9:49 PM Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.2!
>
> The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.2-rc1
> (4168d9c94a9564f6b3e62f5d669acde13a7c7cf6)
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1184
>
> The documentation corresponding to this release can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-docs/
>
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.1.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
>


Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Jonathan Kelly
I think what Reynold probably means is that previews are releases for which
a vote *passed*.

~ Jonathan

On Wed, Jun 1, 2016 at 1:53 PM Marcelo Vanzin  wrote:

> So are RCs, aren't they?
>
> Personally I'm fine with not releasing to maven central. Any extra
> effort needed by regular users to use a preview / RC is good with me.
>
> On Wed, Jun 1, 2016 at 1:50 PM, Reynold Xin  wrote:
> > To play devil's advocate, previews are technically not RCs. They are
> > actually voted releases.
> >
> > On Wed, Jun 1, 2016 at 1:46 PM, Michael Armbrust  >
> > wrote:
> >>
> >> Yeah, we don't usually publish RCs to central, right?
> >>
> >> On Wed, Jun 1, 2016 at 1:06 PM, Reynold Xin 
> wrote:
> >>>
> >>> They are here ain't they?
> >>>
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1182/
> >>>
> >>> Did you mean publishing them to maven central? My understanding is that
> >>> publishing to maven central isn't a required step of doing theses. This
> >>> might be a good opportunity to discuss that. My thought is that it is
> since
> >>> Maven central is immutable, and the purposes of the preview releases
> are to
> >>> get people to test it early on in preparation for the actual release,
> it
> >>> might be better to not publish preview releases to maven central. Users
> >>> testing with preview releases can just use the temporary repository
> above.
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jun 1, 2016 at 11:36 AM, Sean Owen  wrote:
> 
>  Just checked and they are still not published this week. Can these be
>  published ASAP to complete the 2.0.0-preview release?
> >>>
> >>>
> >>
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: some joins stopped working with spark 2.0.0 SNAPSHOT

2016-02-27 Thread Jonathan Kelly
If you want to find what commit caused it, try out the "git bisect" command.
On Sat, Feb 27, 2016 at 11:06 AM Koert Kuipers  wrote:

> https://issues.apache.org/jira/browse/SPARK-13531
>
> On Sat, Feb 27, 2016 at 3:49 AM, Reynold Xin  wrote:
>
>> Can you file a JIRA ticket?
>>
>>
>> On Friday, February 26, 2016, Koert Kuipers  wrote:
>>
>>> dataframe df1:
>>> schema:
>>> StructType(StructField(x,IntegerType,true))
>>> explain:
>>> == Physical Plan ==
>>> MapPartitions , obj#135: object, [if (input[0,
>>> object].isNullAt) null else input[0, object].get AS x#128]
>>> +- MapPartitions , createexternalrow(if (isnull(x#9)) null
>>> else x#9), [input[0, object] AS obj#135]
>>>+- WholeStageCodegen
>>>   :  +- Project [_1#8 AS x#9]
>>>   : +- Scan ExistingRDD[_1#8]
>>> show:
>>> +---+
>>> |  x|
>>> +---+
>>> |  2|
>>> |  3|
>>> +---+
>>>
>>>
>>> dataframe df2:
>>> schema:
>>> StructType(StructField(x,IntegerType,true),
>>> StructField(y,StringType,true))
>>> explain:
>>> == Physical Plan ==
>>> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null
>>> else y#3.toString), [if (input[0, object].isNullAt) null else input[0,
>>> object].get AS x#130,if (input[0, object].isNullAt) null else
>>> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
>>> fromString, input[0, object].get, true) AS y#131]
>>> +- WholeStageCodegen
>>>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>>>: +- Scan ExistingRDD[_1#0,_2#1]
>>> show:
>>> +---+---+
>>> |  x|  y|
>>> +---+---+
>>> |  1|  1|
>>> |  2|  2|
>>> |  3|  3|
>>> +---+---+
>>>
>>>
>>> i run:
>>> df1.join(df2, Seq("x")).show
>>>
>>> i get:
>>> java.lang.UnsupportedOperationException: No size estimation available
>>> for objects.
>>> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
>>> at
>>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
>>> at
>>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>>> at scala.collection.immutable.List.foreach(List.scala:381)
>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>>> at scala.collection.immutable.List.map(List.scala:285)
>>> at
>>> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
>>> at
>>> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87)
>>>
>>> now sure what changed, this ran about a week ago without issues (in our
>>> internal unit tests). it is fully reproducible, however when i tried to
>>> minimize the issue i could not reproduce it by just creating data frames in
>>> the repl with the same contents, so it probably has something to do with
>>> way these are created (from Row objects and StructTypes).
>>>
>>> best, koert
>>>
>>
>


Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Interesting, I was not aware of spark.yarn.am.nodeLabelExpression.

We do use YARN labels on EMR; each node is automatically labeled with its
type (MASTER, CORE, or TASK). And we do
set yarn.app.mapreduce.am.labels=CORE in yarn-site.xml, but we do not set
spark.yarn.am.nodeLabelExpression.

Does Spark somehow not actually honor this? It seems weird that Spark would
have its own similar-sounding property (spark.yarn.am.nodeLabelExpression).
If spark.yarn.am.nodeLabelExpression is used
and yarn.app.mapreduce.am.labels ignored, I could be wrong about Spark AMs
only running on CORE instances in EMR.

I'm guessing though that spark.yarn.am.nodeLabelExpression would simply
override yarn.app.mapreduce.am.labels, so yarn.app.mapreduce.am.labels
would be treated as a default when it is set and
spark.yarn.am.nodeLabelExpression is not. Is that correct?

In short, Alex, you should not need to set any of the label-related
properties yourself if you do what I suggested regarding using small CORE
instances and large TASK instances. But if you want to do something
different, it would also be possible to add a TASK instance group with
small nodes and configured with some new label. Then you could set
spark.yarn.am.nodeLabelExpression to that label.

Thanks, Marcelo, for pointing out spark.yarn.am.nodeLabelExpression!

~ Jonathan

On Tue, Feb 9, 2016 at 9:54 AM Marcelo Vanzin <van...@cloudera.com> wrote:

> You should be able to use spark.yarn.am.nodeLabelExpression if your
> version of YARN supports node labels (and you've added a label to the
> node where you want the AM to run).
>
> On Tue, Feb 9, 2016 at 9:51 AM, Alexander Pivovarov
> <apivova...@gmail.com> wrote:
> > Am container starts first and yarn selects random computer to run it.
> >
> > Is it possible to configure yarn so that it selects small computer for am
> > container.
> >
> > On Feb 9, 2016 12:40 AM, "Sean Owen" <so...@cloudera.com> wrote:
> >>
> >> If it's too small to run an executor, I'd think it would be chosen for
> >> the AM as the only way to satisfy the request.
> >>
> >> On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov
> >> <apivova...@gmail.com> wrote:
> >> > If I add additional small box to the cluster can I configure yarn to
> >> > select
> >> > small box to run am container?
> >> >
> >> >
> >> > On Mon, Feb 8, 2016 at 10:53 PM, Sean Owen <so...@cloudera.com>
> wrote:
> >> >>
> >> >> Typically YARN is there because you're mediating resource requests
> >> >> from things besides Spark, so yeah using every bit of the cluster is
> a
> >> >> little bit of a corner case. There's not a good answer if all your
> >> >> nodes are the same size.
> >> >>
> >> >> I think you can let YARN over-commit RAM though, and allocate more
> >> >> memory than it actually has. It may be beneficial to let them all
> >> >> think they have an extra GB, and let one node running the AM
> >> >> technically be overcommitted, a state which won't hurt at all unless
> >> >> you're really really tight on memory, in which case something might
> >> >> get killed.
> >> >>
> >> >> On Tue, Feb 9, 2016 at 6:49 AM, Jonathan Kelly <
> jonathaka...@gmail.com>
> >> >> wrote:
> >> >> > Alex,
> >> >> >
> >> >> > That's a very good question that I've been trying to answer myself
> >> >> > recently
> >> >> > too. Since you've mentioned before that you're using EMR, I assume
> >> >> > you're
> >> >> > asking this because you've noticed this behavior on emr-4.3.0.
> >> >> >
> >> >> > In this release, we made some changes to the
> >> >> > maximizeResourceAllocation
> >> >> > (which you may or may not be using, but either way this issue is
> >> >> > present),
> >> >> > including the accidental inclusion of somewhat of a bug that makes
> it
> >> >> > not
> >> >> > reserve any space for the AM, which ultimately results in one of
> the
> >> >> > nodes
> >> >> > being utilized only by the AM and not an executor.
> >> >> >
> >> >> > However, as you point out, the only viable fix seems to be to
> reserve
> >> >> > enough
> >> >> > memory for the AM on *every single node*, which in some cases might
> >> >> > actually
> >> >> >

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Sean, I'm not sure if that's actually the case, since the AM would be
allocated before the executors are even requested (by the driver through
the AM), right? This must at least be the case with dynamicAllocation
enabled, but I would expect that it's true regardless.

However, Alex, yes, this would be possible on EMR if you use small CORE
instances and larger TASK instances. EMR is configured to run AMs only on
CORE instances, so if you don't need much HDFS space (HDFS is stored only
on CORE instances, not TASK instances), this might be a good option for
you. Note though that you would have to set spark.executor.memory yourself
though rather than using maximizeResourceAllocation because
maximizeResourceAllocation currently only considers the size of the CORE
instances when determining spark.{driver,executor}.memory.

~ Jonathan

On Tue, Feb 9, 2016 at 12:40 AM Sean Owen <so...@cloudera.com> wrote:

> If it's too small to run an executor, I'd think it would be chosen for
> the AM as the only way to satisfy the request.
>
> On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov
> <apivova...@gmail.com> wrote:
> > If I add additional small box to the cluster can I configure yarn to
> select
> > small box to run am container?
> >
> >
> > On Mon, Feb 8, 2016 at 10:53 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Typically YARN is there because you're mediating resource requests
> >> from things besides Spark, so yeah using every bit of the cluster is a
> >> little bit of a corner case. There's not a good answer if all your
> >> nodes are the same size.
> >>
> >> I think you can let YARN over-commit RAM though, and allocate more
> >> memory than it actually has. It may be beneficial to let them all
> >> think they have an extra GB, and let one node running the AM
> >> technically be overcommitted, a state which won't hurt at all unless
> >> you're really really tight on memory, in which case something might
> >> get killed.
> >>
> >> On Tue, Feb 9, 2016 at 6:49 AM, Jonathan Kelly <jonathaka...@gmail.com>
> >> wrote:
> >> > Alex,
> >> >
> >> > That's a very good question that I've been trying to answer myself
> >> > recently
> >> > too. Since you've mentioned before that you're using EMR, I assume
> >> > you're
> >> > asking this because you've noticed this behavior on emr-4.3.0.
> >> >
> >> > In this release, we made some changes to the
> maximizeResourceAllocation
> >> > (which you may or may not be using, but either way this issue is
> >> > present),
> >> > including the accidental inclusion of somewhat of a bug that makes it
> >> > not
> >> > reserve any space for the AM, which ultimately results in one of the
> >> > nodes
> >> > being utilized only by the AM and not an executor.
> >> >
> >> > However, as you point out, the only viable fix seems to be to reserve
> >> > enough
> >> > memory for the AM on *every single node*, which in some cases might
> >> > actually
> >> > be worse than wasting a lot of memory on a single node.
> >> >
> >> > So yeah, I also don't like either option. Is this just the price you
> pay
> >> > for
> >> > running on YARN?
> >> >
> >> >
> >> > ~ Jonathan
> >> >
> >> > On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov
> >> > <apivova...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Lets say that yarn has 53GB memory available on each slave
> >> >>
> >> >> spark.am container needs 896MB.  (512 + 384)
> >> >>
> >> >> I see two options to configure spark:
> >> >>
> >> >> 1. configure spark executors to use 52GB and leave 1 GB on each box.
> >> >> So,
> >> >> some box will also run am container. So, 1GB memory will not be used
> on
> >> >> all
> >> >> slaves but one.
> >> >>
> >> >> 2. configure spark to use all 53GB and add additional 53GB box which
> >> >> will
> >> >> run only am container. So, 52GB on this additional box will do
> nothing
> >> >>
> >> >> I do not like both options. Is there a better way to configure
> >> >> yarn/spark?
> >> >>
> >> >>
> >> >> Alex
> >
> >
>


Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Praveen,

You mean cluster mode, right? That would still in a sense cause one box to
be "wasted", but at least it would be used a bit more to its full
potential, especially if you set spark.driver.memory to higher than its 1g
default. Also, cluster mode is not an option for some applications, such as
the spark-shell, pyspark shell, or Zeppelin.

~ Jonathan

On Tue, Feb 9, 2016 at 5:48 AM praveen S  wrote:

> How about running in client mode, so that the client from which it is run
> becomes the driver.
>
> Regards,
> Praveen
> On 9 Feb 2016 16:59, "Steve Loughran"  wrote:
>
>>
>> > On 9 Feb 2016, at 06:53, Sean Owen  wrote:
>> >
>> >
>> > I think you can let YARN over-commit RAM though, and allocate more
>> > memory than it actually has. It may be beneficial to let them all
>> > think they have an extra GB, and let one node running the AM
>> > technically be overcommitted, a state which won't hurt at all unless
>> > you're really really tight on memory, in which case something might
>> > get killed.
>>
>>
>> from my test VMs
>>
>>   
>> Whether physical memory limits will be enforced for
>>   containers.
>> 
>> yarn.nodemanager.pmem-check-enabled
>> false
>>   
>>
>>   
>> yarn.nodemanager.vmem-check-enabled
>> false
>>   
>>
>>
>> it does mean that a container can swap massively, hurting the performance
>> of all containers around it as IO bandwidth gets soaked up —which is why
>> the checks are on for shared clusters. If it's dedicated, you can overcommit
>
>


Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
You can set custom per-instance-group configurations (e.g.,
["classification":"yarn-site",properties:{"yarn.nodemanager.labels":"SPARKAM"}])
using the Configurations parameter of
http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_InstanceGroupConfig.html.
Unfortunately, it's not currently possible to specify per-instance-group
configurations via the CLI though, only cluster wide configurations.

~ Jonathan

On Tue, Feb 9, 2016 at 12:36 PM Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Thanks Jonathan
>
> Actually I'd like to use maximizeResourceAllocation.
>
> Ideally for me would be to add new instance group having single small box
> labelled as AM
> I'm not sure "aws emr create-cluster" supports setting custom LABELS , the
> only settings awailable are:
>
> InstanceCount=1,BidPrice=0.5,Name=sparkAM,InstanceGroupType=TASK,InstanceType=m3.xlarge
>
>
> How can I specify yarn label AM for that box?
>
>
>
> On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Interesting, I was not aware of spark.yarn.am.nodeLabelExpression.
>>
>> We do use YARN labels on EMR; each node is automatically labeled with its
>> type (MASTER, CORE, or TASK). And we do
>> set yarn.app.mapreduce.am.labels=CORE in yarn-site.xml, but we do not set
>> spark.yarn.am.nodeLabelExpression.
>>
>> Does Spark somehow not actually honor this? It seems weird that Spark
>> would have its own similar-sounding property
>> (spark.yarn.am.nodeLabelExpression). If spark.yarn.am.nodeLabelExpression
>> is used and yarn.app.mapreduce.am.labels ignored, I could be wrong about
>> Spark AMs only running on CORE instances in EMR.
>>
>> I'm guessing though that spark.yarn.am.nodeLabelExpression would simply
>> override yarn.app.mapreduce.am.labels, so yarn.app.mapreduce.am.labels
>> would be treated as a default when it is set and
>> spark.yarn.am.nodeLabelExpression is not. Is that correct?
>>
>> In short, Alex, you should not need to set any of the label-related
>> properties yourself if you do what I suggested regarding using small CORE
>> instances and large TASK instances. But if you want to do something
>> different, it would also be possible to add a TASK instance group with
>> small nodes and configured with some new label. Then you could set
>> spark.yarn.am.nodeLabelExpression to that label.
>>
>> Thanks, Marcelo, for pointing out spark.yarn.am.nodeLabelExpression!
>>
>> ~ Jonathan
>>
>> On Tue, Feb 9, 2016 at 9:54 AM Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>
>>> You should be able to use spark.yarn.am.nodeLabelExpression if your
>>> version of YARN supports node labels (and you've added a label to the
>>> node where you want the AM to run).
>>>
>>> On Tue, Feb 9, 2016 at 9:51 AM, Alexander Pivovarov
>>> <apivova...@gmail.com> wrote:
>>> > Am container starts first and yarn selects random computer to run it.
>>> >
>>> > Is it possible to configure yarn so that it selects small computer for
>>> am
>>> > container.
>>> >
>>> > On Feb 9, 2016 12:40 AM, "Sean Owen" <so...@cloudera.com> wrote:
>>> >>
>>> >> If it's too small to run an executor, I'd think it would be chosen for
>>> >> the AM as the only way to satisfy the request.
>>> >>
>>> >> On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov
>>> >> <apivova...@gmail.com> wrote:
>>> >> > If I add additional small box to the cluster can I configure yarn to
>>> >> > select
>>> >> > small box to run am container?
>>> >> >
>>> >> >
>>> >> > On Mon, Feb 8, 2016 at 10:53 PM, Sean Owen <so...@cloudera.com>
>>> wrote:
>>> >> >>
>>> >> >> Typically YARN is there because you're mediating resource requests
>>> >> >> from things besides Spark, so yeah using every bit of the cluster
>>> is a
>>> >> >> little bit of a corner case. There's not a good answer if all your
>>> >> >> nodes are the same size.
>>> >> >>
>>> >> >> I think you can let YARN over-commit RAM though, and allocate more
>>> >> >> memory than it actually has. It may be beneficial to let them all
>>> >> >> think they have an extra GB, and let one node running the AM
>>> >> >> technically be overcommi

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Oh, sheesh, how silly of me. I copied and pasted that setting name without
even noticing the "mapreduce" in it. Yes, I guess that would mean that
Spark AMs are probably running even on TASK instances currently, which is
OK but not consistent with what we do for MapReduce. I'll make sure we
set spark.yarn.am.nodeLabelExpression appropriately in the next EMR release.

~ Jonathan

On Tue, Feb 9, 2016 at 1:30 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
> > And we do set yarn.app.mapreduce.am.labels=CORE
>
> That sounds very mapreduce-specific, so I doubt Spark (or anything
> non-MR) would honor it.
>
> --
> Marcelo
>


Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Jonathan Kelly
Alex,

That's a very good question that I've been trying to answer myself recently
too. Since you've mentioned before that you're using EMR, I assume you're
asking this because you've noticed this behavior on emr-4.3.0.

In this release, we made some changes to the maximizeResourceAllocation
(which you may or may not be using, but either way this issue is present),
including the accidental inclusion of somewhat of a bug that makes it not
reserve any space for the AM, which ultimately results in one of the nodes
being utilized only by the AM and not an executor.

However, as you point out, the only viable fix seems to be to reserve
enough memory for the AM on *every single node*, which in some cases might
actually be worse than wasting a lot of memory on a single node.

So yeah, I also don't like either option. Is this just the price you pay
for running on YARN?


~ Jonathan
On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov 
wrote:

> Lets say that yarn has 53GB memory available on each slave
>
> spark.am container needs 896MB.  (512 + 384)
>
> I see two options to configure spark:
>
> 1. configure spark executors to use 52GB and leave 1 GB on each box. So,
> some box will also run am container. So, 1GB memory will not be used on all
> slaves but one.
>
> 2. configure spark to use all 53GB and add additional 53GB box which will
> run only am container. So, 52GB on this additional box will do nothing
>
> I do not like both options. Is there a better way to configure yarn/spark?
>
>
> Alex
>


Re: A proposal for Spark 2.0

2015-11-11 Thread Jonathan Kelly
If Scala 2.12 will require Java 8 and we want to enable cross-compiling
Spark against Scala 2.11 and 2.12, couldn't we just make Java 8 a
requirement if you want to use Scala 2.12?

On Wed, Nov 11, 2015 at 9:29 AM, Koert Kuipers  wrote:

> i would drop scala 2.10, but definitely keep java 7
>
> cross build for scala 2.12 is great, but i dont know how that works with
> java 8 requirement. dont want to make java 8 mandatory.
>
> and probably stating the obvious, but a lot of apis got polluted due to
> binary compatibility requirement. cleaning that up assuming only source
> compatibility would be a good idea, right?
>
> On Tue, Nov 10, 2015 at 6:10 PM, Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Jonathan Kelly
I just clicked the
http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22 link
provided above by Ryan, and I see 1.5.0. Was this just fixed within the
past hour, or is some caching causing some people not to see it?

On Fri, Sep 11, 2015 at 10:24 AM, Reynold Xin  wrote:

> It is already there, but the search is not updated. Not sure what's going
> on with maven central search.
>
>
> http://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.10/1.5.0/
>
>
>
> On Fri, Sep 11, 2015 at 10:21 AM, Ryan Williams <
> ryan.blake.willi...@gmail.com> wrote:
>
>> Any idea why 1.5.0 is not in Maven central yet
>> ?
>> Is that a separate release process?
>>
>>
>> On Wed, Sep 9, 2015 at 12:40 PM andy petrella 
>> wrote:
>>
>>> You can try it out really quickly by "building" a Spark Notebook from
>>> http://spark-notebook.io/.
>>>
>>> Just choose the master branch and 1.5.0, a correct hadoop version
>>> (default to 2.2.0 though) and there you go :-)
>>>
>>>
>>> On Wed, Sep 9, 2015 at 6:39 PM Ted Yu  wrote:
>>>
 Jerry:
 I just tried building hbase-spark module with 1.5.0 and I see:

 ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
 total 21712
 -rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
 -rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37
 spark-core_2.10-1.5.0.jar
 -rw-r--r--  1 tyu  staff41 Sep  9 09:37
 spark-core_2.10-1.5.0.jar.sha1
 -rw-r--r--  1 tyu  staff 19816 Sep  9 09:37
 spark-core_2.10-1.5.0.pom
 -rw-r--r--  1 tyu  staff41 Sep  9 09:37
 spark-core_2.10-1.5.0.pom.sha1

 FYI

 On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:

> Hi Spark Developers,
>
> I'm eager to try it out! However, I got problems in resolving
> dependencies:
> [warn] [NOT FOUND  ]
> org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
> [warn]  jcenter: tried
>
> When the package will be available?
>
> Best Regards,
>
> Jerry
>
>
> On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
> look...@gmail.com> wrote:
>
>> Yeii!
>>
>> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa <
>> yuu.ishikawa+sp...@gmail.com> wrote:
>>
>>> Great work, everyone!
>>>
>>>
>>>
>>> -
>>> -- Yu Ishikawa
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
 --
>>> andy
>>>
>>
>