Github user zjffdu commented on the pull request:

    https://github.com/apache/spark/pull/9490#issuecomment-155713852
  
    Comments for the change:
    The case of empty or non-exist inputs is a little tricky. Here's the 
several cases I summarize
    * Only parse the inputs at execution stage.  e.g. TextRelation
    * Need parse the inputs at analysis stage.  e.g. JsonRelation, 
ParquetRelation & OrcRelation
    * Don't need to parse the inputs if the schema is provided. (when creating 
table)  e.g. ParquetRelation & OrcRelation
    * Empty is also valid. e.g. JsonRelation can accept RDD[String] rather than 
from hdfs
    * Empty inputs is valid for creating table.  
    
    So for these cases, I do the following changes 
    * Add 2 api in HadoopFsRelation. sub classes can override it. Now only 
JsonRelation will override it. 
    ** def inputExists: Boolean = fileStatusCache.inputExists
    ** def readFromHDFS: Boolean = true
    * If the inputs are only empty directories, it should be valid, just return 
EmptyRDD
    * If the inputs are not-existed directories/files, it is invalid, just 
throw exception. 
    * If it needs to parse data in the analysis stage, it is sub classes' 
responsibility to check whether inputs is empty or not. Parent class 
(HadoopFsRelation) only check the inputs at execution stage. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to