Github user zjffdu commented on the pull request:
https://github.com/apache/spark/pull/9490#issuecomment-155713852
Comments for the change:
The case of empty or non-exist inputs is a little tricky. Here's the
several cases I summarize
* Only parse the inputs at execution stage. e.g. TextRelation
* Need parse the inputs at analysis stage. e.g. JsonRelation,
ParquetRelation & OrcRelation
* Don't need to parse the inputs if the schema is provided. (when creating
table) e.g. ParquetRelation & OrcRelation
* Empty is also valid. e.g. JsonRelation can accept RDD[String] rather than
from hdfs
* Empty inputs is valid for creating table.
So for these cases, I do the following changes
* Add 2 api in HadoopFsRelation. sub classes can override it. Now only
JsonRelation will override it.
** def inputExists: Boolean = fileStatusCache.inputExists
** def readFromHDFS: Boolean = true
* If the inputs are only empty directories, it should be valid, just return
EmptyRDD
* If the inputs are not-existed directories/files, it is invalid, just
throw exception.
* If it needs to parse data in the analysis stage, it is sub classes'
responsibility to check whether inputs is empty or not. Parent class
(HadoopFsRelation) only check the inputs at execution stage.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]