Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/11270#issuecomment-186457302
  
    @rxin Actually, as you know, `spark.sql.sources.default` can be different 
datasource, so I think we might have to add some logics to validate all 
datasources from files in this way or add nothing to avoid breaking changes.
    
    If we go for the validation, there are several concerns. 
    
    1. For Parquet we might be able to use "magic number" you just said but as 
far as I remember there is no such thing for ORC but it just starts with index 
data. (Maybe for CSV and JSON we might be able to do this by reading few data 
from the first of files).
    
    2. For reading few bytes can be simply done by reading them directly but if 
we need to read other stuff (for example, reading footer from ORC to validate) 
this will bring complexity just like [this in 
Parquet](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L757-L775).
    
    3. Driver-side overhead would be pretty much increased because basically we 
need to touch each file to make sure it has a datasource of all the datasources.
    
    Could we maybe handle this issue in different PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to