Repository: incubator-impala
Updated Branches:
refs/heads/master 15e6cf8fd -> b2dbcbc2d
IMPALA-5636: Change the metadata in parquet
When writing in parquet format, Impala does not use repetition level.
But the repetition level encoding is set to BIT_PACKED, which is deprecated
and may cause problems when read by other softwares.
Changing it to RLE solves this issue.
Testing: This change is only manually tested.
To test with default testdata loaded:
> create table default.test like tpch_parquet.orders stored as parquet;
> insert into default.random values (0,0,"",0,"","","",0,"");
Then fetch "hdfs://localhost:20500/test-warehouse/test/*.parq" and use
$ java -jar parquet-tools-1.6.0.jar dump /home/tianyi/Downloads/*.parq | grep
RLE:
to inspect the file. Before the change you would see output like
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... VC:1
and after the change they should be
page 0: DLE:RLE RLE:RLE VLE:PLA [more]... VC:1
Change-Id: I4112ec88e8f4050d28661d27f9227450288a6756
Reviewed-on: http://gerrit.cloudera.org:8080/7514
Tested-by: Impala Public Jenkins
Reviewed-by: Tim Armstrong <[email protected]>
Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/b2dbcbc2
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/b2dbcbc2
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/b2dbcbc2
Branch: refs/heads/master
Commit: b2dbcbc2d1bb7d57c5f50989ad25eec1783e52b2
Parents: 15e6cf8
Author: Tianyi Wang <[email protected]>
Authored: Wed Jul 26 16:31:03 2017 -0700
Committer: Tim Armstrong <[email protected]>
Committed: Mon Jul 31 17:03:01 2017 +0000
----------------------------------------------------------------------
be/src/exec/hdfs-parquet-table-writer.cc | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/b2dbcbc2/be/src/exec/hdfs-parquet-table-writer.cc
----------------------------------------------------------------------
diff --git a/be/src/exec/hdfs-parquet-table-writer.cc
b/be/src/exec/hdfs-parquet-table-writer.cc
index 04a81f1..237dd83 100644
--- a/be/src/exec/hdfs-parquet-table-writer.cc
+++ b/be/src/exec/hdfs-parquet-table-writer.cc
@@ -169,7 +169,6 @@ class HdfsParquetTableWriter::BaseColumnWriter {
data_encoding_stats_.clear();
// Repetition/definition level encodings are constant. Incorporate them
here.
column_encodings_.insert(Encoding::RLE);
- column_encodings_.insert(Encoding::BIT_PACKED);
}
// Close this writer. This is only called after Flush() and no more rows will
@@ -738,7 +737,7 @@ void HdfsParquetTableWriter::BaseColumnWriter::NewPage() {
// relies on these specific values for the definition/repetition level
// encodings.
header.definition_level_encoding = Encoding::RLE;
- header.repetition_level_encoding = Encoding::BIT_PACKED;
+ header.repetition_level_encoding = Encoding::RLE;
current_page_->header.__set_data_page_header(header);
}
current_encoding_ = next_page_encoding_;