Marton Greber has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22868 )

Change subject: WIP KUDU-1261 writing/reading of array data blocks
......................................................................


Patch Set 15:

(15 comments)

First pass.

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h
File src/kudu/cfile/block_encodings.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h@110
PS15, Line 110:   // the
nit: incomplete sentence.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h
File src/kudu/cfile/cfile-test-base.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h@575
PS15, Line 575:         flatbuffers::FlatBufferBuilder builder;
This part of the code manually implements FlatBuffers serialization instead of 
using the functions from array_type_serdes.h. Is this intentional?
If not, is it possible to use functions from array_type_serdes.h to avoid code 
duplication?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc
File src/kudu/cfile/cfile-test.cc:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc@1343
PS15, Line 1343: // TODO(aserbin): address DICT_ENCODING issues
What specific issues prevent dictionary encoding from working with arrays?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc
File src/kudu/cfile/cfile_reader.cc:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc@1318
PS15, Line 1318:             unique_ptr<uint8_t[]> cblock_data(
               :                 new uint8_t[array_elems_to_fetch * 
elem_type_info->size()]);
               :             unique_ptr<uint8_t[]> cblock_not_null_bitmap(
               :                 new uint8_t[BitmapSize(array_elems_to_fetch)]);
If I read it correctly this part can be considered hot code path.
Is the above comment "allocate it once for the whole block" the planned 
solution that is more performant here?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc
File src/kudu/cfile/cfile_writer.cc:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@93
PS15, Line 93: static const size_t kMaxElementsInArray = 1024;
I'm also +1 on getting this from a flag.

Also, this limit doesn't seem to be enforced anywhere in the code - is this 
intentional?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@238
PS15, Line 238:     size_t nrows =
are we concerned about overflows here?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@497
PS15, Line 497:         RETURN_NOT_OK(view.Init());
What happens if `RETURN_NOT_OK` is triggered, masking stays enabled?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@514
PS15, Line 514:         // Mask the 'block is full' while writing a single 
array.
              :         data_block_->SetBlockFullMasked(true);
IIUC very large arrays can cause blocks to exceed size limits dramatically.
How are you planning to solve this?
Consider adding todo for that if not addressed in this patch.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@627
PS15, Line 627:   // TODO(aserbin): clarify on first element ordinals in array 
data blocks
Are there any open questions regarding ordinals?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h
File src/kudu/common/array_cell_view.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h@57
PS15, Line 57:     opt.max_depth = 3;
             :     opt.max_tables = 65536 + 1;
Hard-coded magic numbers should be named constants.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h@189
PS15, Line 189:   // TODO(aserbin): switch arrray1d to bitfield instead of bool 
vector for validity?
+1


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h
File src/kudu/common/array_type_serdes.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@37
PS15, Line 37: namespace kudu {
This file has no test coverage.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@97
PS15, Line 97: val.resize(nrows);
             :   if (column_data != nullptr) {
             :     DCHECK_NE(0, nrows);
             :     if constexpr (std::is_same<Slice, ElementType>::value) {
             :       const Slice* ptr = reinterpret_cast<const 
Slice*>(column_data);
             :       for (size_t idx = 0; idx < nrows; ++idx, ++ptr) {
             :         const Slice elem = *ptr;
             :         val.emplace_back(elem.data(), elem.size());
is the combination of resize & emplace_back really the intention here?


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@239
PS15, Line 239: serialzing
nit: serializing


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@246
PS15, Line 246: inline Status Serialize(
Missing from Serialize() but present in SerializeIntoArena():
- FLOAT, DOUBLE
- UINT64, INT64
- DATE, UNIXTIME_MICROS
- DECIMAL32, DECIMAL64
- VARCHAR
- BOOL

is this intentional?



--
To view, visit http://gerrit.cloudera.org:8080/22868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5825c939cb40d350a23a78609115f26e62cca270
Gerrit-Change-Number: 22868
Gerrit-PatchSet: 15
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Xuebin Su <[email protected]>
Gerrit-Comment-Date: Tue, 23 Sep 2025 13:28:30 +0000
Gerrit-HasComments: Yes

Reply via email to