Abhishek Rawat has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15304 )

Change subject: IMPALA-9389: [DOCS] Support reading zstd text files
......................................................................


Patch Set 4:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/15304/3/docs/topics/impala_txtfile.xml
File docs/topics/impala_txtfile.xml:

http://gerrit.cloudera.org:8080/#/c/15304/3/docs/topics/impala_txtfile.xml@650
PS3, Line 650:         capability. Impala can read compressed text files 
written by Hive or compressed by the
> I don't think it is necessary to document the details such as streaming/blo
I did some experimentation and while Impala can read text files compressed by 
gzip, bzip2, zstd standard library, there are some exceptions. Hadoop uses a 
custom block format for snappy which is not compatible with regular snappy. So, 
I am hesitant to claim that Impala can read files compressed by std library 
implementation.

Moreover, don't think we test that scenario for all codecs. We do have interop 
tests in the repository which tests interop between Hive and Impala text 
codecs. So, I think we should instead just claim:
"Impala can read compressed text files written by Hive"


http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml
File docs/topics/impala_txtfile.xml:

http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@124
PS4, Line 124:         In Impala 2.0 and later, you can also use text data 
compressed in the gzip, bzip2, or Snappy formats.
We can probably also include 'zstd' here.


http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@141
PS4, Line 141:         </p>
We can add a new para here:
"Impala supports zstd files created by the zstd command line tool"


http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@653
PS4, Line 653: or zstd-compressed text file is processed, the node doing the
             :         work reads the entire file into memory and then 
decompresses it. Therefore, the node must
             :         have enough memory to hold both the compressed and 
uncompressed data from the text file
> For text zstd decompression, we're using streaming, which doesn't load all
Good point. zstd, bzip2 and gzip all use streaming and so they don't have to 
read the entire file in memory. We should update the documentation.

This particular section is text specific documentation so don't think we should 
add any info about Parquet. I think Parquet doesn't have that problem since the 
unit of compression is a page and not a file.


http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@659
PS4, Line 659:           gzip-compressed text files. The gzipped data is 
decompressed as it is read, rather than
This applies to gzip, bzip2 and zstd. The data is decompressed as it is read.


http://gerrit.cloudera.org:8080/#/c/15304/4/docs/topics/impala_txtfile.xml@691
PS4, Line 691: ...make equivalent .gz, .bz2, and .snappy files and load them 
into same table directory...
I think we can also include `.zst` files here. If we have different compressed 
files in the table directory, Impala can read them. This applies to zstd too. 
Similarly the `select *` example below should also be updated to return data 
read from `.zst` files. The `hdfs dfs -ls` should also be updated to show the 
`.zst` file along with others.



--
To view, visit http://gerrit.cloudera.org:8080/15304
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic83137bd2c3a49398fb60cf1901f8b74ed111fce
Gerrit-Change-Number: 15304
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Xiaomeng Zhang <[email protected]>
Gerrit-Comment-Date: Fri, 28 Feb 2020 15:21:43 +0000
Gerrit-HasComments: Yes

Reply via email to