Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
Is this the reason the old sequence file writer created corrupted files ?


PS4, Line 139: reinterpret_cast<const uint8_t*>(KEY_CLASS_NAME));
nit: indent 4. Same below.


Line 191:   record.WriteBytes(output_length, output);
It seems a bit unfortunate that we don't free the temp buffer (i.e. output) 
after copying the data to record.

Do you know if there may be other allocations from mem_pool_ which need to 
persist across calls to WriteCompressedBlock() ? Otherwise, we should consider 
calling mem_pool_->FreeAll() at the caller of this function to avoid 
accumulating too much memory when writing a lot of data.


PS4, Line 193:   // Output compressed keys block-size & compressed keys block.
             :   // The keys block contains "\0\0\0\0" byte sequence as a key 
for each row (this is what
             :   // Hive does).
Does not writing key-lengths block and key block prevent the file from being 
read back correctly in Hive ?


http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h
File be/src/exec/hdfs-sequence-table-writer.h:

Line 29: 
Would be good to add the details of the Sequence file's layout as top level 
comment here. The following link seems to include quite a bit of useful 
information:

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/SequenceFile.html


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 230: // 
https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/WritableUtils.html
DCHECK_GE(num_bytes, 2);
DCHECK_LE(num_bytes, 9);


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc
File be/src/util/compress.cc:

Line 248:     outp += size;
DCHECK_LE(outp - out_buffer_,  length);


http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 212: SET COMPRESSION_CODEC=SNAPPY_BLOCKED;
May be helpful to also add a test for writing empty file:

select * from tpcds_parquet.store_sales where ss_item_sk=-1;


http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

PS3, Line 170:     # Write block-compressed sequence file and then read it back 
in Hive
             :     table_name = '%s.seq_tbl_block' % unique_database
             :     self.client.execute("set COMPRESSION_CODEC=DEFAULT")
             :     self.client.execute("set SEQ_COMPRESSION_MODE=BLOCK")
             :     self.client.execute("create table %s like 
functional.alltypes" % table_name)
             :     self.client.execute("insert into %s partition(year, month) "
             :         "select * from functional.alltypes" % table_name)
             :     output = self.run_stmt_in_hive("select count(*) from %s" % 
table_name)
             :     assert "7300" == output.split('\n')[1]
Doesn't this duplicate the second test ? May help to test with empty file and 
file less than 4K and greater than 4K.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-HasComments: Yes

Reply via email to