Michael Ho has posted comments on this change. Change subject: IMPALA-3079: Fix sequence file writer ......................................................................
Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc File be/src/exec/hdfs-sequence-table-writer.cc: PS4, Line 179: : Is this the reason the old sequence file writer created corrupted files ? PS4, Line 139: reinterpret_cast<const uint8_t*>(KEY_CLASS_NAME)); nit: indent 4. Same below. Line 191: record.WriteBytes(output_length, output); It seems a bit unfortunate that we don't free the temp buffer (i.e. output) after copying the data to record. Do you know if there may be other allocations from mem_pool_ which need to persist across calls to WriteCompressedBlock() ? Otherwise, we should consider calling mem_pool_->FreeAll() at the caller of this function to avoid accumulating too much memory when writing a lot of data. PS4, Line 193: // Output compressed keys block-size & compressed keys block. : // The keys block contains "\0\0\0\0" byte sequence as a key for each row (this is what : // Hive does). Does not writing key-lengths block and key block prevent the file from being read back correctly in Hive ? http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h File be/src/exec/hdfs-sequence-table-writer.h: Line 29: Would be good to add the details of the Sequence file's layout as top level comment here. The following link seems to include quite a bit of useful information: https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/SequenceFile.html http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: Line 230: // https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/WritableUtils.html DCHECK_GE(num_bytes, 2); DCHECK_LE(num_bytes, 9); http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc File be/src/util/compress.cc: Line 248: outp += size; DCHECK_LE(outp - out_buffer_, length); http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test: Line 212: SET COMPRESSION_CODEC=SNAPPY_BLOCKED; May be helpful to also add a test for writing empty file: select * from tpcds_parquet.store_sales where ss_item_sk=-1; http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py File tests/query_test/test_compressed_formats.py: PS3, Line 170: # Write block-compressed sequence file and then read it back in Hive : table_name = '%s.seq_tbl_block' % unique_database : self.client.execute("set COMPRESSION_CODEC=DEFAULT") : self.client.execute("set SEQ_COMPRESSION_MODE=BLOCK") : self.client.execute("create table %s like functional.alltypes" % table_name) : self.client.execute("insert into %s partition(year, month) " : "select * from functional.alltypes" % table_name) : output = self.run_stmt_in_hive("select count(*) from %s" % table_name) : assert "7300" == output.split('\n')[1] Doesn't this duplicate the second test ? May help to test with empty file and file less than 4K and greater than 4K. -- To view, visit http://gerrit.cloudera.org:8080/6107 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-HasComments: Yes
