Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22868 )

Change subject: WIP KUDU-1261 writing/reading of array data blocks
......................................................................


Patch Set 15:

(12 comments)

Thanks a lot for the review!

This is still a WIP patch, so expect several more revisions to come.  I'm also 
planning to separate some parts of this functionality into their own patches.

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h
File src/kudu/cfile/block_encodings.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/block_encodings.h@110
PS15, Line 110:   // the
> nit: incomplete sentence.
Done


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h
File src/kudu/cfile/cfile-test-base.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test-base.h@575
PS15, Line 575:         flatbuffers::FlatBufferBuilder builder;
> This part of the code manually implements FlatBuffers serialization instead
This code appeared before the idea of array_type_serdes.h and other wrappers.  
Yes, this needs to use Serialize() from array_type_serdes.h

I'm planning to update this in future revisions of this patch (as well as 
reduce code duplication in array_type_serdes.h itself).


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc
File src/kudu/cfile/cfile-test.cc:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile-test.cc@1343
PS15, Line 1343: // TODO(aserbin): address DICT_ENCODING issues
> What specific issues prevent dictionary encoding from working with arrays?
The dictionary encoder needs updating to take into account masking of the 
'block is full' bit.

I don't think it makes sense to mention it here since this is a TODO to 
addressing before removing the WIP tag: it cannot left unaddressed.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc
File src/kudu/cfile/cfile_reader.cc:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_reader.cc@1318
PS15, Line 1318:             unique_ptr<uint8_t[]> cblock_data(
               :                 new uint8_t[array_elems_to_fetch * 
elem_type_info->size()]);
               :             unique_ptr<uint8_t[]> cblock_not_null_bitmap(
               :                 new uint8_t[BitmapSize(array_elems_to_fetch)]);
> If I read it correctly this part can be considered hot code path.
Right: the idea is to allocate the memory for this just once (maybe, with some 
extra margin), and then re-use it here instead of allocating and de-allocating 
it again and again.  Alternatively, it might be allocated in the Arena where 
where results are being serialized into.  That's what those TODOs above are 
about.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc
File src/kudu/cfile/cfile_writer.cc:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@93
PS15, Line 93: static const size_t kMaxElementsInArray = 1024;
> I'm also +1 on getting this from a flag.
It's not intentional -- it's just not here yet: this is still a WIP patch.

In addition to this limit, there should be also a limit on the size of array 
element size for string/binary types.

Thanks for pointing at this.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@238
PS15, Line 238:     size_t nrows =
> are we concerned about overflows here?
In this current code of PS15: yes.  But this needs updating.

The idea is to not overshoot the configured size of the block more than 2x 
(could be done with look-ahead approach when writing array elements 
cell-by-cell), and allocate it here with such a margin.

Alternatively, these might be allocated in a queue-style manner with initial 
estimation as-is, but adding more chunks later on.  I had ChunkedBitmapIterator 
for that, but then I got rid of that approach, trying to keep this simple.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/cfile/cfile_writer.cc@627
PS15, Line 627:   // TODO(aserbin): clarify on first element ordinals in array 
data blocks
> Are there any open questions regarding ordinals?
No more at this point: it's already resolved, IIRC.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h
File src/kudu/common/array_cell_view.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_cell_view.h@57
PS15, Line 57:     opt.max_depth = 3;
             :     opt.max_tables = 65536 + 1;
> Hard-coded magic numbers should be named constants.
You'd notice they weren't magic numbers if you read the comments above.

Why do you think these need to be named constants?  I'd think that constants 
make sense only if they are used elsewhere; these are used only here and they 
aren't exposed outside of this local scope by any means.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h
File src/kudu/common/array_type_serdes.h:

http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@37
PS15, Line 37: namespace kudu {
> This file has no test coverage.
Yes, that's right: this is a WIP patch even as of PS15.

Almost nothing has test coverage in this changelist :)  The idea was to add it 
gradually, separating functionality by smaller chunks into changelists of their 
own.

The first priority was to build a stack of patches for
  1) unblocking people who are working on integrating this into other systems 
(e.g., working on supporting this from the Impala side)
  2) giving people an ability to provide feedback
  3) better visibility of the current progress

I'm planning to revv this WIP patch several more times, but please don't be 
discouraged to provide your feedback just because of that.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@97
PS15, Line 97: val.resize(nrows);
             :   if (column_data != nullptr) {
             :     DCHECK_NE(0, nrows);
             :     if constexpr (std::is_same<Slice, ElementType>::value) {
             :       const Slice* ptr = reinterpret_cast<const 
Slice*>(column_data);
             :       for (size_t idx = 0; idx < nrows; ++idx, ++ptr) {
             :         const Slice elem = *ptr;
             :         val.emplace_back(elem.data(), elem.size());
> is the combination of resize & emplace_back really the intention here?
Nope, it wasn't: there was 'reserve' first.  Actually, updates in this file is 
split between this changelist and this one: 
https://gerrit.cloudera.org/#/c/23220/9/src/kudu/common/array_type_serdes.h

I'm planning to have them here.


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@239
PS15, Line 239: serialzing
> nit: serializing
Done


http://gerrit.cloudera.org:8080/#/c/22868/15/src/kudu/common/array_type_serdes.h@246
PS15, Line 246: inline Status Serialize(
> Missing from Serialize() but present in SerializeIntoArena():
That's taken care in another changelist: 
https://gerrit.cloudera.org/#/c/23220/9/src/kudu/common/array_type_serdes.h



--
To view, visit http://gerrit.cloudera.org:8080/22868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5825c939cb40d350a23a78609115f26e62cca270
Gerrit-Change-Number: 22868
Gerrit-PatchSet: 15
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Xuebin Su <[email protected]>
Gerrit-Comment-Date: Wed, 24 Sep 2025 00:28:05 +0000
Gerrit-HasComments: Yes

Reply via email to