[GitHub] spark pull request #22528: [SPARK-25513][SQL] Read zipped CSV and JSON

MaxGekk Sun, 23 Sep 2018 01:14:35 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22528#discussion_r219691537
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
 ---
    @@ -41,7 +42,12 @@ object CodecStreams {
     
         getDecompressionCodec(config, file)
           .map(codec => codec.createInputStream(inputStream))
    -      .getOrElse(inputStream)
    +      .orElse {
    +        if (file.getName.toLowerCase.endsWith(".zip")) {
    +          val zip = new ZipArchiveInputStream(inputStream)
    +          if (zip.getNextEntry != null) Some(zip) else None
    +        } else None
    +      }.getOrElse(inputStream)
    --- End diff --
    
    > ... but isn't this difficult to extend this support to non multiline 
modes ...
    
    In per-line mode, need to teach Hadoop's Line Reader how to read zip 
archives. It seems the feasible way is via `Decompressor` interface. The 
problem is the interface is designed to read compression data in stream 
fashion. That contradicts to structure of ZIP format where the directory 
structure is located at the end of file. 
    
    > and multiple files) as well
    
    In general, it is possible to support per-line mode and multiple files, if 
`LineRecordReader` will be extended in the same way as I do in this PR (not via 
`Decompressor` API) but `LineRecordReader`, `SplitLineReader` and maybe 
something else should to be move out of Hadoop to Spark. From my point of view 
it could be done but not in this simple PR.
    
    > Basically deflate is the same codec and I wonder if we really should 
allow this zip one specifically in multiline mode for CSV / JSON specifically 
with a clear restriction (single file). 
    
    Significant different between `zip` and `deflate` is the first one is some 
kind of container of independently compressed entries. In Spark's datasources, 
there is a restriction for now - one file maps to one `InputStream`:
    - in per-line mode: ```        in = new 
SplitLineReader(codec.createInputStream(fileIn,
                decompressor), job, this.recordDelimiterBytes);
            filePosition = fileIn;``` . If you return one `InputStream` for all 
files in a zip archive, it can be difficult to differentiate last line in one 
file and the first line in another file. 
    - in multi-line mode, the `readFile` methods should be modified to handle 
multiple `InputStream` per `file: PartitionedFile` (to eliminate restriction of 
one file per zip archive).
    
     For sure, all of those issues can be solved but 
    - Multiple files per zip archive is significantly rare case that one zipped 
JSON/CSV in the wild. Maybe you have another statistics.
    - Support per-line mode demands principal changes - extension and movement 
to Spark Hadoop's per-line reader.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22528: [SPARK-25513][SQL] Read zipped CSV and JSON

Reply via email to