I filed a jira [SPARK-23304] Spark SQL coalesce() against hive not working - 
ASF JIRA for the coalesce issue.

| 
| 
|  | 
[SPARK-23304] Spark SQL coalesce() against hive not working - ASF JIRA


 |

 |

 |



Tom
    On Thursday, February 1, 2018, 12:36:02 PM CST, Sameer Agarwal 
<samee...@apache.org> wrote:  
 
 [+ Xiao]
SPARK-23290  does sound like a blocker. On the SQL side, I can confirm that 
there were non-trivial changes around repartitioning/coalesce and cache 
performance in 2.3 --  we're currently investigating these.
On 1 February 2018 at 10:02, Andrew Ash <and...@andrewash.com> wrote:

I'd like to nominate SPARK-23290 as a potential blocker for the 2.3.0 release.  
It's a regression from 2.2.0 in that user pyspark code that works in 2.2.0 now 
fails in the 2.3.0 RCs: the type return type of date columns changed from 
object to datetime64[ns].  My understanding of the Spark Versioning Policy is 
that user code should continue to run in future versions of Spark with the same 
major version number.
Thanks!
On Thu, Feb 1, 2018 at 9:50 AM, Tom Graves <tgraves...@yahoo.com.invalid> wrote:

 
Testing with spark 2.3 and I see a difference in the sql coalesce talking to 
hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
Query:spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
'20170301' AND dt <= '20170331' AND something IS NOT 
NULL").coalesce(160000).show()

in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.   Anyone 
know about this issue or are there some weird config changes, otherwise I'll 
file a jira?
Note I also see a performance difference when reading cached data. Spark 2.3. 
Small query on 19GB cached data, spark 2.3 is 30% worse.  This is only 13 
seconds on spark 2.2 vs 17 seconds on spark 2.3.  Straight up reading from hive 
(orc) seems better though.
Tom


    On Thursday, February 1, 2018, 11:23:45 AM CST, Michael Heuer 
<heue...@gmail.com> wrote:  
 
 We found two classes new to Spark 2.3.0 that must be registered in Kryo for 
our tests to pass on RC2

org.apache.spark.sql.execution .datasources.BasicWriteTaskSta ts
org.apache.spark.sql.execution .datasources.ExecutedWriteSumm ary

https://github.com/bigdatageno mics/adam/pull/1897

Perhaps a mention in release notes?

   michael


On Thu, Feb 1, 2018 at 3:29 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote:

All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that 
should be everything outstanding.

On Thu, 1 Feb 2018 at 06:21 Yin Huai <yh...@databricks.com> wrote:

seems we are not running tests related to pandas in pyspark tests (see my email 
"python tests related to pandas are skipped in jenkins"). I think we should fix 
this test issue and make sure all tests are good before cutting RC3.
On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal <samee...@apache.org> wrote:

Just a quick status update on RC3 -- SPARK-23274 was resolved yesterday and 
tests have been quite healthy throughout this week and the last. I'll cut the 
new RC as soon as the remaining blocker (SPARK-23202) is resolved.

On 30 January 2018 at 10:12, Andrew Ash <and...@andrewash.com> wrote:

I'd like to nominate SPARK-23274 as a potential blocker for the 2.3.0 release 
as well, due to being a regression from 2.2.0.  The ticket has a simple repro 
included, showing a query that works in prior releases but now fails with an 
exception in the catalyst optimizer.
On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal <sameer.a...@gmail.com> wrote:

This vote has failed due to a number of aforementioned blockers. I'll follow up 
with RC3 as soon as the 2 remaining (non-QA) blockers are resolved: 
https://s.apache. org/oXKi


On 25 January 2018 at 12:59, Sameer Agarwal <sameer.a...@gmail.com> wrote:



Most tests pass on RC2, except I'm still seeing the timeout caused by 
https://issues.apache.org/ jira/browse/SPARK-23055 ; the tests never finish. I 
followed the thread a bit further and wasn't clear whether it was subsequently 
re-fixed for 2.3.0 or not. It says it's resolved along with 
https://issues.apache. org/jira/browse/SPARK-22908  for 2.3.0 though I am still 
seeing these tests fail or hang:
- subscribing topic by name from earliest offsets (failOnDataLoss: false)- 
subscribing topic by name from earliest offsets (failOnDataLoss: true)

Sean, while some of these tests were timing out on RC1, we're not aware of any 
known issues in RC2. Both maven (https://amplab.cs.berkeley. 
edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/ 
spark-branch-2.3-test-maven- hadoop-2.6/146/testReport/org. 
apache.spark.sql.kafka010/ history/) and sbt (https://amplab.cs.berkeley. 
edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/ 
spark-branch-2.3-test-sbt- hadoop-2.6/123/testReport/org. 
apache.spark.sql.kafka010/ history/) historical builds on jenkins for 
org.apache.spark.sql. kafka010 look fairly healthy. If you're still seeing 
timeouts in RC2, can you create a JIRA with any applicable build/env info?
 
On Tue, Jan 23, 2018 at 9:01 AM Sean Owen <so...@cloudera.com> wrote:

I'm not seeing that same problem on OS X and /usr/bin/tar. I tried unpacking it 
with 'xvzf' and also unzipping it first, and it untarred without warnings in 
either case.
I am encountering errors while running the tests, different ones each time, so 
am still figuring out whether there is a real problem or just flaky tests.
These issues look like blockers, as they are inherently to be completed before 
the 2.3 release. They are mostly not done. I suppose I'd -1 on behalf of those 
who say this needs to be done first, though, we can keep testing.
SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrellaSPARK-23114 Spark R 2.3 QA 
umbrella
Here are the remaining items targeted for 2.3:
SPARK-15689 Data source API v2SPARK-20928 SPIP: Continuous Processing Mode for 
Structured StreamingSPARK-21646 Add new type coercion rules to compatible with 
HiveSPARK-22386 Data Source V2 improvementsSPARK-22731 Add a test for ROWID 
type to OracleIntegrationSuiteSPARK-22735 Add VectorSizeHint to ML features 
documentationSPARK-22739 Additional Expression Support for ObjectsSPARK-22809 
pyspark is sensitive to imports with dotsSPARK-22820 Spark 2.3 SQL API audit

On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin <van...@cloudera.com> wrote:

+0

Signatures check out. Code compiles, although I see the errors in [1]
when untarring the source archive; perhaps we should add "use GNU tar"
to the RM checklist?

Also ran our internal tests and they seem happy.

My concern is the list of open bugs targeted at 2.3.0 (ignoring the
documentation ones). It is not long, but it seems some of those need
to be looked at. It would be nice for the committers who are involved
in those bugs to take a look.

[1] https://superuser.com/ questions/318809/linux-os-x- 
tar-incompatibility-tarballs- created-on-os-x-give-errors- when-unt


On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal <samee...@apache.org> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am UTC and
> passes if a majority of at least 3 PMC +1 votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc2:
> https://github.com/apache/ spark/tree/v2.3.0-rc2
> ( 489ecb0ef23e5d9b705e5e5bae4fa3 d871bdac91)
>
> List of JIRA tickets resolved in this release can be found here:
> https://issues.apache.org/ jira/projects/SPARK/versions/ 12339551
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/ dist/dev/spark/v2.3.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/ dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/ content/repositories/ orgapachespark-1262/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/ dist/dev/spark/v2.3.0-rc2- 
> docs/_site/index.html
>
>
> FAQ
>
> ============================== =========
> What are the unresolved issues targeted for 2.3.0?
> ============================== =========
>
> Please see https://s.apache.org/oXKi. At the time of writing, there are
> currently no known release blockers.
>
> =========================
> How can I help test this release?
> =========================
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
> ============================== =============
> What should happen to JIRA tickets still targeting 2.3.0?
> ============================== =============
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
> appropriate.
>
> ===================
> Why is my bug not fixed?
> ===================
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.2.0. That being said, if
> there is something which is a regression from 2.2.0 and has not been
> correctly targeted please ping me or a committer to help target the issue
> (you can see the open issues listed as impacting Spark 2.3.0 at
> https://s.apache.org/WmoI).
>
>
> Regards,
> Sameer



--
Marcelo

------------------------------ ------------------------------ ---------
To unsubscribe e-mail: dev-unsubscribe@spark.apache. org










-- 
Sameer AgarwalComputer Science | UC Berkeleyhttp://cs.berkeley.edu/~ sameerag









  



  

Reply via email to