[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

tdas Mon, 02 May 2016 18:29:20 -0700

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/12774#issuecomment-216411689
  
    @sbcd90 I dont get your example. Your example actually shows that only file 
`/test_spark/join1.json` is considered in Spark 1.6.1. In Spark master, this is 
broken as both files will be considered. The reason for this bug is that in 
Spark 1.6.1, there were two code paths - 
[one](https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L67)
 when partitioning is detected, 
[another](https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L121)
 without. This led to the non-partitioning case not consider directories 
recursively, which is what the behavior should be. 
    
    In current master, after refactoring, there is only one code path, that 
uses FileCatalog and HDFSFileCatalog, which always returns all the files 
recursively, even when there is not partitioning scheme in the directory 
structure.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14997]Files in subdirectories are incor...

Reply via email to