Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/23613 )
Change subject: IMPALA-14092 Part2: Support querying of paimon data table via JNI ...................................................................... Patch Set 8: (16 comments) Thanks for your great work, again! Started looking at the backend stuff, will continue next week. http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.h File be/src/exec/paimon/paimon-jni-row-reader.h: http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.h@36 PS8, Line 36: an nit: 'an' is redundant http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.h@43 PS8, Line 43: nit: we usually don't put space here http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.h@48 PS8, Line 48: : nit: redundant empty lines http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.h@53 PS8, Line 53: * nit: we put the '*' next to the type http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.h@97 PS8, Line 97: CharVarChar nit: For WriteStringOrBinarySlot String is mentioned first, then Binary, and the template bool param refers to the second. In WriteCharVarCharSlot Char is mentioned first, then VarChar, and the template bool param refers to the first. It'd be better align the two methods. http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc File be/src/exec/paimon/paimon-jni-row-reader.cc: http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@64 PS8, Line 64: t nit: needs +4 indent http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@123 PS8, Line 123: c nit: missing space http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@148 PS8, Line 148: dynamic_cast We could add DCHECK_NOTNULL around the dynamic_cast. Current code might dereference a null pointer. Or, if we are sure about the actual arrow Array subtype, then we can just use static_cast which is much more efficient than dynamic_cast. Especially that we are doing a dynamic cast for each value in the array. It would be interesting to measure performance with static_casts only. http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@149 PS8, Line 149: bool result = bool_array->Value(row_idx); : *reinterpret_cast<bool*>(slot) = result; : return Status::OK(); This pattern repeats multiple times, could be moved to a template function like this: template<typename T, typename AT> Status WriteSlot(const AT* arrow_array, int row_idx, void* slot) { T value = arrow_array->Value(row_idx); *reinterpret_cast<T*>(slot) = value; return Status::OK(); } In most cases the following method could be used : template<typename T, typename AT> Status DowncastAndWriteSlot(const arrow::Array* arrow_array, const int row_idx, void* slot) { auto derived_array = row_idx == 0 ? DCHECK_NOTNULL(dynamic_cast<const AT*>(arrow_array)) : static_cast<const AT*>(arrow_array); return WriteSlot<T>(derived_array, row_idx, slot); } http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@171 PS8, Line 171: reinterpret_cast<int8_t>( Is it needed? 'v' is already int16_t. Same goes for L179, L187, L195. http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-row-reader.cc@270 PS8, Line 270: type.timezone(); Do we have tests with TIMESTAMP WITH LOCAL TIME ZONE and TIMESTAMP (WITHOUT TIME ZONE) as well? http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-scan-node.h File be/src/exec/paimon/paimon-jni-scan-node.h: http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-scan-node.h@57 PS8, Line 57: Comsume nit: Consume http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-scan-node.cc File be/src/exec/paimon/paimon-jni-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-scan-node.cc@148 PS8, Line 148: : nit: too many empty lines http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-scan-node.cc@162 PS8, Line 162: nit: redundant empty line http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-scan-node.cc@209 PS8, Line 209: row nit: row batch http://gerrit.cloudera.org:8080/#/c/23613/8/be/src/exec/paimon/paimon-jni-scan-node.cc@218 PS8, Line 218: RETURN_IF_ERROR(paimon_row_reader_->MaterializeTuple( : *paimon_arrow_record_batch_holder_, arrow_record_batch_row_index_, tuple_desc_, : tuple, row_batch->tuple_data_pool(), state)); : /// Evaluate conjuncts on this tuple row : if (ExecNode::EvalConjuncts(conjunct_evals().data(), : conjunct_evals().size(), tuple_row)) { : row_batch->CommitLastRow(); : tuple = reinterpret_cast<Tuple*>( : reinterpret_cast<uint8_t*>(tuple) + tuple_desc_->byte_size()); : IncrementNumRowsReturned(1); Since we are getting column-oriented data, in the future (definitely not in this CR) we could make the flow similar to the columnar scanners, where we fill a "scratch batch" by going through the columns one by one, then applying the conjuncts to the scratch batch. -- To view, visit http://gerrit.cloudera.org:8080/23613 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384 Gerrit-Change-Number: 23613 Gerrit-PatchSet: 8 Gerrit-Owner: ji chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: ji chen <[email protected]> Gerrit-Comment-Date: Fri, 21 Nov 2025 17:27:31 +0000 Gerrit-HasComments: Yes
