GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/8035

    [SPARK-9743] [SQL] Fixes JSONRelation refreshing

    PR #7969 added two `HadoopFsRelation.refresh()` calls ([this] [1], and 
[this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save 
directly to the path of a JSON table` pass. However, this forces every 
`HadoopFsRelation` table scan to do a refresh, which can be super expensive for 
tables with large number of partitions.
    
    The reason why the original test case fails without the `refresh()` calls 
is that, the old JSON relation builds the base RDD with the input paths, while 
`HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON 
relation, we can create a temporary table based on a path, writing data to 
that, and then read newly written data without refreshing the table. This is no 
long true for `HadoopFsRelation`.
    
    This PR removes those two expensive refresh calls, and moves the refresh 
into `JSONRelation` to fix this issue. We might want to update 
`HadoopFsRelation` interface to provide better support for this use case.
    
    [1]: 
https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63
    [2]: 
https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark 
spark-9743/fix-json-relation-refreshing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8035.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8035
    
----
commit ec1957ded822c458a9d4ed732b955eadc6a2568f
Author: Cheng Lian <[email protected]>
Date:   2015-08-07T16:23:43Z

    Fixes JSONRelation refreshing

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to