GitHub user yhuai opened a pull request:
https://github.com/apache/spark/pull/11697
[SPARK-13207][SQL][BRANCH-1.6] Make partitioning discovery ignore _SUCCESS
files.
If a _SUCCESS appears in the inner partitioning dir, partition discovery
will treat that _SUCCESS file as a data file. Then, partition discovery will
fail because it finds that the dir structure is not valid. We should ignore
those `_SUCCESS` files.
In future, it is better to ignore all files/dirs starting with `_` or `.`.
This PR does not make this change. I am thinking about making this change
simple, so we can consider of getting it in branch 1.6.
To ignore all files/dirs starting with `_` or `, the main change is to let
ParquetRelation have another way to get metadata files. Right now, it relies on
FileStatusCache's cachedLeafStatuses, which returns file statuses of both
metadata files (e.g. metadata files used by parquet) and data files, which
requires more changes.
https://issues.apache.org/jira/browse/SPARK-13207
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yhuai/spark SPARK13207_branch16
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11697.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11697
----
commit 02908f45af6618da627df77a67dc474cd3e6496d
Author: Yin Huai <[email protected]>
Date: 2016-03-14T16:03:13Z
[SPARK-13207][SQL] Make partitioning discovery ignore _SUCCESS files.
If a _SUCCESS appears in the inner partitioning dir, partition discovery
will treat that _SUCCESS file as a data file. Then, partition discovery will
fail because it finds that the dir structure is not valid. We should ignore
those `_SUCCESS` files.
In future, it is better to ignore all files/dirs starting with `_` or `.`.
This PR does not make this change. I am thinking about making this change
simple, so we can consider of getting it in branch 1.6.
To ignore all files/dirs starting with `_` or `, the main change is to let
ParquetRelation have another way to get metadata files. Right now, it relies on
FileStatusCache's cachedLeafStatuses, which returns file statuses of both
metadata files (e.g. metadata files used by parquet) and data files, which
requires more changes.
https://issues.apache.org/jira/browse/SPARK-13207
Author: Yin Huai <[email protected]>
Closes #11088 from yhuai/SPARK-13207.
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]