Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22528#discussion_r219691537
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
---
@@ -41,7 +42,12 @@ object CodecStreams {
getDecompressionCodec(config, file)
.map(codec => codec.createInputStream(inputStream))
- .getOrElse(inputStream)
+ .orElse {
+ if (file.getName.toLowerCase.endsWith(".zip")) {
+ val zip = new ZipArchiveInputStream(inputStream)
+ if (zip.getNextEntry != null) Some(zip) else None
+ } else None
+ }.getOrElse(inputStream)
--- End diff --
> ... but isn't this difficult to extend this support to non multiline
modes ...
In per-line mode, need to teach Hadoop's Line Reader how to read zip
archives. It seems the feasible way is via `Decompressor` interface. The
problem is the interface is designed to read compression data in stream
fashion. That contradicts to structure of ZIP format where the directory
structure is located at the end of file.
> and multiple files) as well
In general, it is possible to support per-line mode and multiple files, if
`LineRecordReader` will be extended in the same way as I do in this PR (not via
`Decompressor` API) but `LineRecordReader`, `SplitLineReader` and maybe
something else should to be move out of Hadoop to Spark. From my point of view
it could be done but not in this simple PR.
> Basically deflate is the same codec and I wonder if we really should
allow this zip one specifically in multiline mode for CSV / JSON specifically
with a clear restriction (single file).
Significant different between `zip` and `deflate` is the first one is some
kind of container of independently compressed entries. In Spark's datasources,
there is a restriction for now - one file maps to one `InputStream`:
- in per-line mode: ``` in = new
SplitLineReader(codec.createInputStream(fileIn,
decompressor), job, this.recordDelimiterBytes);
filePosition = fileIn;``` . If you return one `InputStream` for all
files in a zip archive, it can be difficult to differentiate last line in one
file and the first line in another file.
- in multi-line mode, the `readFile` methods should be modified to handle
multiple `InputStream` per `file: PartitionedFile` (to eliminate restriction of
one file per zip archive).
For sure, all of those issues can be solved but
- Multiple files per zip archive is significantly rare case that one zipped
JSON/CSV in the wild. Maybe you have another statistics.
- Support per-line mode demands principal changes - extension and movement
to Spark Hadoop's per-line reader.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]