Marton Greber has posted comments on this change. ( http://gerrit.cloudera.org:8080/22868 )
Change subject: WIP KUDU-1261 writing/reading of array data blocks ...................................................................... Patch Set 15: (15 comments) First pass. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h File src/kudu/cfile/block_encodings.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h@110 PS15, Line 110: // the nit: incomplete sentence. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h File src/kudu/cfile/cfile-test-base.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h@575 PS15, Line 575: flatbuffers::FlatBufferBuilder builder; This part of the code manually implements FlatBuffers serialization instead of using the functions from array_type_serdes.h. Is this intentional? If not, is it possible to use functions from array_type_serdes.h to avoid code duplication? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc File src/kudu/cfile/cfile-test.cc: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc@1343 PS15, Line 1343: // TODO(aserbin): address DICT_ENCODING issues What specific issues prevent dictionary encoding from working with arrays? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc File src/kudu/cfile/cfile_reader.cc: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc@1318 PS15, Line 1318: unique_ptr<uint8_t[]> cblock_data( : new uint8_t[array_elems_to_fetch * elem_type_info->size()]); : unique_ptr<uint8_t[]> cblock_not_null_bitmap( : new uint8_t[BitmapSize(array_elems_to_fetch)]); If I read it correctly this part can be considered hot code path. Is the above comment "allocate it once for the whole block" the planned solution that is more performant here? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc File src/kudu/cfile/cfile_writer.cc: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@93 PS15, Line 93: static const size_t kMaxElementsInArray = 1024; I'm also +1 on getting this from a flag. Also, this limit doesn't seem to be enforced anywhere in the code - is this intentional? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@238 PS15, Line 238: size_t nrows = are we concerned about overflows here? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@497 PS15, Line 497: RETURN_NOT_OK(view.Init()); What happens if `RETURN_NOT_OK` is triggered, masking stays enabled? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@514 PS15, Line 514: // Mask the 'block is full' while writing a single array. : data_block_->SetBlockFullMasked(true); IIUC very large arrays can cause blocks to exceed size limits dramatically. How are you planning to solve this? Consider adding todo for that if not addressed in this patch. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@627 PS15, Line 627: // TODO(aserbin): clarify on first element ordinals in array data blocks Are there any open questions regarding ordinals? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h File src/kudu/common/array_cell_view.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h@57 PS15, Line 57: opt.max_depth = 3; : opt.max_tables = 65536 + 1; Hard-coded magic numbers should be named constants. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h@189 PS15, Line 189: // TODO(aserbin): switch arrray1d to bitfield instead of bool vector for validity? +1 http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h File src/kudu/common/array_type_serdes.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@37 PS15, Line 37: namespace kudu { This file has no test coverage. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@97 PS15, Line 97: val.resize(nrows); : if (column_data != nullptr) { : DCHECK_NE(0, nrows); : if constexpr (std::is_same<Slice, ElementType>::value) { : const Slice* ptr = reinterpret_cast<const Slice*>(column_data); : for (size_t idx = 0; idx < nrows; ++idx, ++ptr) { : const Slice elem = *ptr; : val.emplace_back(elem.data(), elem.size()); is the combination of resize & emplace_back really the intention here? http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@239 PS15, Line 239: serialzing nit: serializing http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@246 PS15, Line 246: inline Status Serialize( Missing from Serialize() but present in SerializeIntoArena(): - FLOAT, DOUBLE - UINT64, INT64 - DATE, UNIXTIME_MICROS - DECIMAL32, DECIMAL64 - VARCHAR - BOOL is this intentional? -- To view, visit http://gerrit.cloudera.org:8080/22868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5825c939cb40d350a23a78609115f26e62cca270 Gerrit-Change-Number: 22868 Gerrit-PatchSet: 15 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Xuebin Su <[email protected]> Gerrit-Comment-Date: Tue, 23 Sep 2025 13:28:30 +0000 Gerrit-HasComments: Yes
