Testing with spark 2.3 and I see a difference in the sql coalesce talking to
hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
Query:spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >=
'20170301' AND dt <= '20170331' AND something IS NOT
NULL").coalesce(160000).show()
in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't. Anyone
know about this issue or are there some weird config changes, otherwise I'll
file a jira?
Note I also see a performance difference when reading cached data. Spark 2.3.
Small query on 19GB cached data, spark 2.3 is 30% worse. This is only 13
seconds on spark 2.2 vs 17 seconds on spark 2.3. Straight up reading from hive
(orc) seems better though.
Tom
On Thursday, February 1, 2018, 11:23:45 AM CST, Michael Heuer
<[email protected]> wrote:
We found two classes new to Spark 2.3.0 that must be registered in Kryo for
our tests to pass on RC2
org.apache.spark.sql.execution.datasources.BasicWriteTaskStats
org.apache.spark.sql.execution.datasources.ExecutedWriteSummary
https://github.com/bigdatagenomics/adam/pull/1897
Perhaps a mention in release notes?
michael
On Thu, Feb 1, 2018 at 3:29 AM, Nick Pentreath <[email protected]> wrote:
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that
should be everything outstanding.
On Thu, 1 Feb 2018 at 06:21 Yin Huai <[email protected]> wrote:
seems we are not running tests related to pandas in pyspark tests (see my email
"python tests related to pandas are skipped in jenkins"). I think we should fix
this test issue and make sure all tests are good before cutting RC3.
On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal <[email protected]> wrote:
Just a quick status update on RC3 -- SPARK-23274 was resolved yesterday and
tests have been quite healthy throughout this week and the last. I'll cut the
new RC as soon as the remaining blocker (SPARK-23202) is resolved.
On 30 January 2018 at 10:12, Andrew Ash <[email protected]> wrote:
I'd like to nominate SPARK-23274 as a potential blocker for the 2.3.0 release
as well, due to being a regression from 2.2.0. The ticket has a simple repro
included, showing a query that works in prior releases but now fails with an
exception in the catalyst optimizer.
On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal <[email protected]> wrote:
This vote has failed due to a number of aforementioned blockers. I'll follow up
with RC3 as soon as the 2 remaining (non-QA) blockers are resolved:
https://s.apache. org/oXKi
On 25 January 2018 at 12:59, Sameer Agarwal <[email protected]> wrote:
Most tests pass on RC2, except I'm still seeing the timeout caused by
https://issues.apache.org/ jira/browse/SPARK-23055 ; the tests never finish. I
followed the thread a bit further and wasn't clear whether it was subsequently
re-fixed for 2.3.0 or not. It says it's resolved along with
https://issues.apache. org/jira/browse/SPARK-22908 for 2.3.0 though I am still
seeing these tests fail or hang:
- subscribing topic by name from earliest offsets (failOnDataLoss: false)-
subscribing topic by name from earliest offsets (failOnDataLoss: true)
Sean, while some of these tests were timing out on RC1, we're not aware of any
known issues in RC2. Both maven (https://amplab.cs.berkeley.
edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/
spark-branch-2.3-test-maven- hadoop-2.6/146/testReport/org.
apache.spark.sql.kafka010/ history/) and sbt (https://amplab.cs.berkeley.
edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/
spark-branch-2.3-test-sbt- hadoop-2.6/123/testReport/org.
apache.spark.sql.kafka010/ history/) historical builds on jenkins for
org.apache.spark.sql. kafka010 look fairly healthy. If you're still seeing
timeouts in RC2, can you create a JIRA with any applicable build/env info?
On Tue, Jan 23, 2018 at 9:01 AM Sean Owen <[email protected]> wrote:
I'm not seeing that same problem on OS X and /usr/bin/tar. I tried unpacking it
with 'xvzf' and also unzipping it first, and it untarred without warnings in
either case.
I am encountering errors while running the tests, different ones each time, so
am still figuring out whether there is a real problem or just flaky tests.
These issues look like blockers, as they are inherently to be completed before
the 2.3 release. They are mostly not done. I suppose I'd -1 on behalf of those
who say this needs to be done first, though, we can keep testing.
SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrellaSPARK-23114 Spark R 2.3 QA
umbrella
Here are the remaining items targeted for 2.3:
SPARK-15689 Data source API v2SPARK-20928 SPIP: Continuous Processing Mode for
Structured StreamingSPARK-21646 Add new type coercion rules to compatible with
HiveSPARK-22386 Data Source V2 improvementsSPARK-22731 Add a test for ROWID
type to OracleIntegrationSuiteSPARK-22735 Add VectorSizeHint to ML features
documentationSPARK-22739 Additional Expression Support for ObjectsSPARK-22809
pyspark is sensitive to imports with dotsSPARK-22820 Spark 2.3 SQL API audit
On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin <[email protected]> wrote:
+0
Signatures check out. Code compiles, although I see the errors in [1]
when untarring the source archive; perhaps we should add "use GNU tar"
to the RM checklist?
Also ran our internal tests and they seem happy.
My concern is the list of open bugs targeted at 2.3.0 (ignoring the
documentation ones). It is not long, but it seems some of those need
to be looked at. It would be nice for the committers who are involved
in those bugs to take a look.
[1] https://superuser.com/ questions/318809/linux-os-x-
tar-incompatibility-tarballs- created-on-os-x-give-errors- when-unt
On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal <[email protected]> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am UTC and
> passes if a majority of at least 3 PMC +1 votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc2:
> https://github.com/apache/ spark/tree/v2.3.0-rc2
> ( 489ecb0ef23e5d9b705e5e5bae4fa3 d871bdac91)
>
> List of JIRA tickets resolved in this release can be found here:
> https://issues.apache.org/ jira/projects/SPARK/versions/ 12339551
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/ dist/dev/spark/v2.3.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/ dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/ content/repositories/ orgapachespark-1262/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/ dist/dev/spark/v2.3.0-rc2-
> docs/_site/index.html
>
>
> FAQ
>
> ============================== =========
> What are the unresolved issues targeted for 2.3.0?
> ============================== =========
>
> Please see https://s.apache.org/oXKi. At the time of writing, there are
> currently no known release blockers.
>
> =========================
> How can I help test this release?
> =========================
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
> ============================== =============
> What should happen to JIRA tickets still targeting 2.3.0?
> ============================== =============
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
> appropriate.
>
> ===================
> Why is my bug not fixed?
> ===================
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.2.0. That being said, if
> there is something which is a regression from 2.2.0 and has not been
> correctly targeted please ping me or a committer to help target the issue
> (you can see the open issues listed as impacting Spark 2.3.0 at
> https://s.apache.org/WmoI).
>
>
> Regards,
> Sameer
--
Marcelo
------------------------------ ------------------------------ ---------
To unsubscribe e-mail: [email protected]. org
--
Sameer AgarwalComputer Science | UC Berkeleyhttp://cs.berkeley.edu/~ sameerag