Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/22868 )
Change subject: WIP KUDU-1261 writing/reading of array data blocks ...................................................................... Patch Set 15: (12 comments) Thanks a lot for the review! This is still a WIP patch, so expect several more revisions to come. I'm also planning to separate some parts of this functionality into their own patches. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h File src/kudu/cfile/block_encodings.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h@110 PS15, Line 110: // the > nit: incomplete sentence. Done http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h File src/kudu/cfile/cfile-test-base.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h@575 PS15, Line 575: flatbuffers::FlatBufferBuilder builder; > This part of the code manually implements FlatBuffers serialization instead This code appeared before the idea of array_type_serdes.h and other wrappers. Yes, this needs to use Serialize() from array_type_serdes.h I'm planning to update this in future revisions of this patch (as well as reduce code duplication in array_type_serdes.h itself). http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc File src/kudu/cfile/cfile-test.cc: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc@1343 PS15, Line 1343: // TODO(aserbin): address DICT_ENCODING issues > What specific issues prevent dictionary encoding from working with arrays? The dictionary encoder needs updating to take into account masking of the 'block is full' bit. I don't think it makes sense to mention it here since this is a TODO to addressing before removing the WIP tag: it cannot left unaddressed. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc File src/kudu/cfile/cfile_reader.cc: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc@1318 PS15, Line 1318: unique_ptr<uint8_t[]> cblock_data( : new uint8_t[array_elems_to_fetch * elem_type_info->size()]); : unique_ptr<uint8_t[]> cblock_not_null_bitmap( : new uint8_t[BitmapSize(array_elems_to_fetch)]); > If I read it correctly this part can be considered hot code path. Right: the idea is to allocate the memory for this just once (maybe, with some extra margin), and then re-use it here instead of allocating and de-allocating it again and again. Alternatively, it might be allocated in the Arena where where results are being serialized into. That's what those TODOs above are about. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc File src/kudu/cfile/cfile_writer.cc: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@93 PS15, Line 93: static const size_t kMaxElementsInArray = 1024; > I'm also +1 on getting this from a flag. It's not intentional -- it's just not here yet: this is still a WIP patch. In addition to this limit, there should be also a limit on the size of array element size for string/binary types. Thanks for pointing at this. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@238 PS15, Line 238: size_t nrows = > are we concerned about overflows here? In this current code of PS15: yes. But this needs updating. The idea is to not overshoot the configured size of the block more than 2x (could be done with look-ahead approach when writing array elements cell-by-cell), and allocate it here with such a margin. Alternatively, these might be allocated in a queue-style manner with initial estimation as-is, but adding more chunks later on. I had ChunkedBitmapIterator for that, but then I got rid of that approach, trying to keep this simple. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@627 PS15, Line 627: // TODO(aserbin): clarify on first element ordinals in array data blocks > Are there any open questions regarding ordinals? No more at this point: it's already resolved, IIRC. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h File src/kudu/common/array_cell_view.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h@57 PS15, Line 57: opt.max_depth = 3; : opt.max_tables = 65536 + 1; > Hard-coded magic numbers should be named constants. You'd notice they weren't magic numbers if you read the comments above. Why do you think these need to be named constants? I'd think that constants make sense only if they are used elsewhere; these are used only here and they aren't exposed outside of this local scope by any means. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h File src/kudu/common/array_type_serdes.h: http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@37 PS15, Line 37: namespace kudu { > This file has no test coverage. Yes, that's right: this is a WIP patch even as of PS15. Almost nothing has test coverage in this changelist :) The idea was to add it gradually, separating functionality by smaller chunks into changelists of their own. The first priority was to build a stack of patches for 1) unblocking people who are working on integrating this into other systems (e.g., working on supporting this from the Impala side) 2) giving people an ability to provide feedback 3) better visibility of the current progress I'm planning to revv this WIP patch several more times, but please don't be discouraged to provide your feedback just because of that. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@97 PS15, Line 97: val.resize(nrows); : if (column_data != nullptr) { : DCHECK_NE(0, nrows); : if constexpr (std::is_same<Slice, ElementType>::value) { : const Slice* ptr = reinterpret_cast<const Slice*>(column_data); : for (size_t idx = 0; idx < nrows; ++idx, ++ptr) { : const Slice elem = *ptr; : val.emplace_back(elem.data(), elem.size()); > is the combination of resize & emplace_back really the intention here? Nope, it wasn't: there was 'reserve' first. Actually, updates in this file is split between this changelist and this one: https://gerrit.cloudera.org/#/c/23220/9/src/kudu/common/array_type_serdes.h I'm planning to have them here. http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@239 PS15, Line 239: serialzing > nit: serializing Done http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@246 PS15, Line 246: inline Status Serialize( > Missing from Serialize() but present in SerializeIntoArena(): That's taken care in another changelist: https://gerrit.cloudera.org/#/c/23220/9/src/kudu/common/array_type_serdes.h -- To view, visit http://gerrit.cloudera.org:8080/22868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5825c939cb40d350a23a78609115f26e62cca270 Gerrit-Change-Number: 22868 Gerrit-PatchSet: 15 Gerrit-Owner: Alexey Serbin <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Xuebin Su <[email protected]> Gerrit-Comment-Date: Wed, 24 Sep 2025 00:28:05 +0000 Gerrit-HasComments: Yes
