[GitHub] spark pull request #22528: [SPARK-25513][SQL] Read zipped CSV and JSON

HyukjinKwon Tue, 25 Sep 2018 18:58:00 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22528#discussion_r220406916
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
 ---
    @@ -41,7 +42,12 @@ object CodecStreams {
     
         getDecompressionCodec(config, file)
           .map(codec => codec.createInputStream(inputStream))
    -      .getOrElse(inputStream)
    +      .orElse {
    +        if (file.getName.toLowerCase.endsWith(".zip")) {
    +          val zip = new ZipArchiveInputStream(inputStream)
    +          if (zip.getNextEntry != null) Some(zip) else None
    +        } else None
    +      }.getOrElse(inputStream)
    --- End diff --
    
    It might be feasible but still difficult. This is a simple PR but I don't 
think we should just allow this single case alone. 
    
    > Multiple files per zip archive is significantly rare case that one zipped 
JSON/CSV in the wild.
    
    Yea, I doubt if that's rare because it, at least, wasn't in my (humble and 
maybe biased) experience in the previous company. 
    
    How about we just propose the changes you are thinking currently? looks you 
have a clear idea on this.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22528: [SPARK-25513][SQL] Read zipped CSV and JSON

Reply via email to