Abhishek Rawat has posted comments on this change. ( http://gerrit.cloudera.org:8080/15304 )
Change subject: IMPALA-9389: [DOCS] Support reading zstd text files ...................................................................... Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/15304/3/docs/topics/impala_txtfile.xml File docs/topics/impala_txtfile.xml: http://gerrit.cloudera.org:8080/#/c/15304/3/docs/topics/impala_txtfile.xml@650 PS3, Line 650: capability. Impala can read compressed text files written by Hive or compressed by the > I don't think it is necessary to document the details such as streaming/blo I did some experimentation and while Impala can read text files compressed by gzip, bzip2, zstd standard library, there are some exceptions. Hadoop uses a custom block format for snappy which is not compatible with regular snappy. So, I am hesitant to claim that Impala can read files compressed by std library implementation. Moreover, don't think we test that scenario for all codecs. We do have interop tests in the repository which tests interop between Hive and Impala text codecs. So, I think we should instead just claim: "Impala can read compressed text files written by Hive" http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml File docs/topics/impala_txtfile.xml: http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@124 PS4, Line 124: In Impala 2.0 and later, you can also use text data compressed in the gzip, bzip2, or Snappy formats. We can probably also include 'zstd' here. http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@141 PS4, Line 141: </p> We can add a new para here: "Impala supports zstd files created by the zstd command line tool" http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@653 PS4, Line 653: or zstd-compressed text file is processed, the node doing the : work reads the entire file into memory and then decompresses it. Therefore, the node must : have enough memory to hold both the compressed and uncompressed data from the text file > For text zstd decompression, we're using streaming, which doesn't load all Good point. zstd, bzip2 and gzip all use streaming and so they don't have to read the entire file in memory. We should update the documentation. This particular section is text specific documentation so don't think we should add any info about Parquet. I think Parquet doesn't have that problem since the unit of compression is a page and not a file. http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@659 PS4, Line 659: gzip-compressed text files. The gzipped data is decompressed as it is read, rather than This applies to gzip, bzip2 and zstd. The data is decompressed as it is read. http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@691 PS4, Line 691: ...make equivalent .gz, .bz2, and .snappy files and load them into same table directory... I think we can also include `.zst` files here. If we have different compressed files in the table directory, Impala can read them. This applies to zstd too. Similarly the `select *` example below should also be updated to return data read from `.zst` files. The `hdfs dfs -ls` should also be updated to show the `.zst` file along with others. -- To view, visit http://gerrit.cloudera.org:8080/15304 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic83137bd2c3a49398fb60cf1901f8b74ed111fce Gerrit-Change-Number: 15304 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Xiaomeng Zhang <[email protected]> Gerrit-Comment-Date: Fri, 28 Feb 2020 15:21:43 +0000 Gerrit-HasComments: Yes
