Yes, this is not a blocker. "spark.sql.optimizer.nestedSchemaPruning.enabled" is intentionally off by default. As DB Tsai said, column pruning of nested schema for Parquet tables is experimental. In this release, we encourage the whole community to try this new feature but it might have bugs like what the JIRA SPARK-25879 reports.
We still can fix the issues in the minor release of Spark 2.4, as long as the risk is not high. Thanks, Xiao On Mon, Oct 29, 2018 at 11:49 PM DB Tsai <dbt...@dbtsai.com.invalid> wrote: > +0 > > I understand that schema pruning is an experimental feature in Spark > 2.4, and this can help a lot in read performance as people are trying > to keep the hierarchical data in nested format. > > We just found a serious bug---it could fail parquet reader if a nested > field and top level field are selected simultaneously. > https://issues.apache.org/jira/browse/SPARK-25879 > > If we decide to not fix it in 2.4, we should at least document it in > the release note to let users know. > > Sincerely, > > DB Tsai > ---------------------------------------------------------- > Web: https://www.dbtsai.com > PGP Key ID: 0x5CED8B896A6BDFA0 > On Mon, Oct 29, 2018 at 8:42 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > > > > +1 > > > > 2018년 10월 30일 (화) 오전 11:03, Gengliang Wang <ltn...@gmail.com>님이 작성: > >> > >> +1 > >> > >> > 在 2018年10月30日,上午10:41,Sean Owen <sro...@gmail.com> 写道: > >> > > >> > +1 > >> > > >> > Same result as in RC4 from me, and the issues I know of that were > >> > raised with RC4 are resolved. I tested vs Scala 2.12 and 2.11. > >> > > >> > These items are still targeted to 2.4.0; Xiangrui I assume these > >> > should just be untargeted now, or resolved? > >> > SPARK-25584 Document libsvm data source in doc site > >> > SPARK-25346 Document Spark builtin data sources > >> > SPARK-24464 Unit tests for MLlib's Instrumentation > >> > On Mon, Oct 29, 2018 at 5:22 AM Wenchen Fan <cloud0...@gmail.com> > wrote: > >> >> > >> >> Please vote on releasing the following candidate as Apache Spark > version 2.4.0. > >> >> > >> >> The vote is open until November 1 PST and passes if a majority +1 > PMC votes are cast, with > >> >> a minimum of 3 +1 votes. > >> >> > >> >> [ ] +1 Release this package as Apache Spark 2.4.0 > >> >> [ ] -1 Do not release this package because ... > >> >> > >> >> To learn more about Apache Spark, please see > http://spark.apache.org/ > >> >> > >> >> The tag to be voted on is v2.4.0-rc5 (commit > 0a4c03f7d084f1d2aa48673b99f3b9496893ce8d): > >> >> https://github.com/apache/spark/tree/v2.4.0-rc5 > >> >> > >> >> The release files, including signatures, digests, etc. can be found > at: > >> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-bin/ > >> >> > >> >> Signatures used for Spark RCs can be found in this file: > >> >> https://dist.apache.org/repos/dist/dev/spark/KEYS > >> >> > >> >> The staging repository for this release can be found at: > >> >> > https://repository.apache.org/content/repositories/orgapachespark-1291 > >> >> > >> >> The documentation corresponding to this release can be found at: > >> >> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-docs/ > >> >> > >> >> The list of bug fixes going into 2.4.0 can be found at the following > URL: > >> >> https://issues.apache.org/jira/projects/SPARK/versions/12342385 > >> >> > >> >> FAQ > >> >> > >> >> ========================= > >> >> How can I help test this release? > >> >> ========================= > >> >> > >> >> If you are a Spark user, you can help us test this release by taking > >> >> an existing Spark workload and running on this release candidate, > then > >> >> reporting any regressions. > >> >> > >> >> If you're working in PySpark you can set up a virtual env and install > >> >> the current RC and see if anything important breaks, in the > Java/Scala > >> >> you can add the staging repository to your projects resolvers and > test > >> >> with the RC (make sure to clean up the artifact cache before/after so > >> >> you don't end up building with a out of date RC going forward). > >> >> > >> >> =========================================== > >> >> What should happen to JIRA tickets still targeting 2.4.0? > >> >> =========================================== > >> >> > >> >> The current list of open tickets targeted at 2.4.0 can be found at: > >> >> https://issues.apache.org/jira/projects/SPARK and search for > "Target Version/s" = 2.4.0 > >> >> > >> >> Committers should look at those and triage. Extremely important bug > >> >> fixes, documentation, and API tweaks that impact compatibility should > >> >> be worked on immediately. Everything else please retarget to an > >> >> appropriate release. > >> >> > >> >> ================== > >> >> But my bug isn't fixed? > >> >> ================== > >> >> > >> >> In order to make timely releases, we will typically not hold the > >> >> release unless the bug in question is a regression from the previous > >> >> release. That being said, if there is something which is a regression > >> >> that has not been correctly targeted please ping me or a committer to > >> >> help target the issue. > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- [image: Spark+AI Summit North America 2019] <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>