Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20010 )

Change subject: IMPALA-11996: Scanner change for Iceberg metadtata querying
......................................................................


Patch Set 5:

(64 comments)

Thanks for the patch, Tamas! Does really seem like a lot of work.

I took a first look and I think I found some mem leaks around 
IcebergMetadataScanNode.

I still have to digest the code in IcebergMetadataTableScanner, though. 
Honestly, for me it seems pretty ugly to have JNI call within c++ for literally 
everything. I naively thought that we could somehow let the Java part do the 
Java stuff and the C++ part only meant to ask for the next set of results in 
some format, like thrift.
Even if that's not possible, I think we can give some subtask to the Java part, 
like "please create me the object for the metadata table" and then we can hide 
the majority of the java class/variable/method/type references in the c++ code.
Can't we somehow keep the Java references minimal and let's say maintain the 
iterator that traverses the results, but then ask the Java part to get us the 
actual results giving it the iterator? Could results be passed in thrift or 
some buffer format between the 2 words? Once we got them, we could move the 
values into the row_batch.

I'm curious what others think about this, though.

http://gerrit.cloudera.org:8080/#/c/20010/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20010/5//COMMIT_MSG@7
PS5, Line 7: metadtata
typo


http://gerrit.cloudera.org:8080/#/c/20010/5//COMMIT_MSG@17
PS5, Line 17: se
typo


http://gerrit.cloudera.org:8080/#/c/20010/5//COMMIT_MSG@17
PS5, Line 17: struct column types
it's not just struct but nested types in general


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h
File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@20
PS5, Line 20: #include "exec/iceberg-metadata/iceberg-metadata-table-scanner.h"
Would it help to remove this include if we had a forward declaration of 
IcebergMetadataTableScanner in this header file?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@22
PS5, Line 22: #include "runtime/runtime-state.h"
            : #include "util/jni-util.h"
same as above


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@41
PS5, Line 41: /// ScanNode ancestor -> ExecNode
I don't think this comment is neccessary


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@42
PS5, Line 42: class IcebergMetadataScanNode : public ScanNode {
Don't you need a virtual destructor for this class?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@49
PS5, Line 49: Iceberg TableScan
What is an Iceberg 'TableScan'? I haven't found any reference in the cc file.


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@52
PS5, Line 52:   /// Get next rowbatch from the table scanner
this comment doesn't add much


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@55
PS5, Line 55:   /// Close the Iceberg TableScan
This comment doesn't add much


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@58
PS5, Line 58:   Status GetCatalogTable(JNIEnv* env, jobject* jtable);
private?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@60
PS5, Line 60:  protected:
Are there any derived classes from this one? I haven't found any. What's the 
reason having protected members?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@61
PS5, Line 61: tuple_desc_
nit: I think we use ' char around variable names in comments. like 'tuple_desc_'


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@67
PS5, Line 67: metadtata
typo


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@69
PS5, Line 69:   const string* metadata_table_name_;
does this have to be a pointer? isn't regular string enough?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.h@73
PS5, Line 73: scoped_ptr
unique_ptr ?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc
File be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@36
PS5, Line 36:     table_name_(new 
TTableName(pnode.tnode_->iceberg_scan_metadata_node.table_name)) {
Who deallocates this if you allocate this on the heap?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@37
PS5, Line 37: new string
Who deallocates this if you allocate this on the heap?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@43
PS5, Line 43: NULL
nullptr instead


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@44
PS5, Line 44:     return Status("Failed to get/create JVM");
merge with line above?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@49
PS5, Line 49: NULL
nullptr instead


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@50
PS5, Line 50:     return Status("Failed to get tuple descriptor.");
Merge with line above?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@55
PS5, Line 55:   jobject* jtable = new jobject();
Who is responsible of releasing the memory allocated here?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@55
PS5, Line 55: jobject* jtable = new jobject();
What I'm wondering here is if the only purpose of 'jtable' here is that we pass 
it to IcebergMetadataScanner than why don't we let the Scanner object create it 
itself. Then it won't be a question which class is responsible of deallocating 
the memory and such.


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@56
PS5, Line 56: // Move to Prepare or Init?
drop comment


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@63
PS5, Line 63:   JNIEnv* env = JniUtil::GetJNIEnv();
I'm not familiar with this JNIEnv but is this something that changes by time? 
If not, can't this become a class member and populate in Prepare or such so 
that we don't need to query and check for nullness multiple times?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@64
PS5, Line 64:   if (env == NULL) {
nullptr and merge with row below


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@67
PS5, Line 67: env
Do we have to keep pushing this JNIEnv into IcebergMetadataScanner through all 
the function call or can it save it at Init() and re-use it later? If yes, then 
forget my suggestion of making it a class member here and rather make it a 
member in IcebergMetadataScanner.


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-scan-node.cc@79
PS5, Line 79:     if (row_batch->AtCapacity() || *eos) {
Instead of 'while(true)' can't you make this the entry condition of the while 
loop? (Actually the negated condition of this)


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h
File be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h:

http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@24
PS5, Line 24: #include <memory>
I think this is not used


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@39
PS5, Line 39:
nit: extra space


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@72
PS5, Line 72:   inline static jclass fe_iceberg_table_cl_ = nullptr;
I'm wondering if it would make sense to place these java references to some 
common place. E.g. java_int_class and pals seem pretty common, there might be a 
need somewhere else. Not sure HBaseScanner uses any of these.


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@92
PS5, Line 92: references references
nit: typo


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@113
PS5, Line 113:   /// Getting enum objects are not straigthforward calls: enum 
-> field -> object
Frankly, I don't see how this comment is related to the line below :)


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@134
PS5, Line 134:   Status ReadTimeStampValue(JNIEnv* env, SlotDescriptor* 
slot_desc,
There are TimeStamp with TZ columns in metadata views that Impala doesn't 
support. Do you convert it to UTC and store it in a regular TS type?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@142
PS5, Line 142:  protected:
Aren't the below are 'private' too instead of protected?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@143
PS5, Line 143:   /// TupleDescriptor received from the ScanNode.
In general, comments like this don't add much extra value.


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@147
PS5, Line 147: veified
typo


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@151
PS5, Line 151:   jobject* jmetadata_table_ = new jobject();
Who deallocates this?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@153
PS5, Line 153: slot
'slots' ?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.h@156
PS5, Line 156: thorugh
nit: typo


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.cc
File be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.cc@133
PS5, Line 133: Status 
IcebergMetadataTableScanner::CreateJIcebergMetadataTable(JNIEnv* env,
Can't we replace this sequence of call with one JNI call to java that gets some 
details about the metadata table, like name and the name of the underlying 
Iceberg table and gives us the metadata table object?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.cc@158
PS5, Line 158: ScanMetadataTable
Another thing that makes me wondering is that do we actually perform the 
reading of the table here? We do planFiles() that gives us the list of files 
that we have to read and then we get an iterator for these files but I'm not 
sure we actually do read data here as the name of the function suggests.


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.cc@158
PS5, Line 158: Status IcebergMetadataTableScanner::ScanMetadataTable(JNIEnv* 
env) {
I wonder if we can hide this call from the interface somehow. E.g. include this 
in GetNext() in a way that first we check if we have executed the scan already 
and call this if we haven't. This would simplify the sequence of function call 
needed on the callsite. E.g. What if GetNext() is called without calling 
ScanMetadataTable() before?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/exec/iceberg-metadata/iceberg-metadata-table-scanner.cc@177
PS5, Line 177: Status IcebergMetadataTableScanner::CreateJAccessors(JNIEnv* 
env) {
Again, might be only my preference but here I'd prefer to like a single JNI 
call that gives us whatever we want to put into jaccessors_. I simply find 
these sequence of CallObjectMethod() calls unreadable.

for (... slot_desc...) {
  jaccessors_[pos] = env->GetAccessorForFieldOrSomething(...);
}

And I feel the same for the rest of the code in this class, just don't see the 
point of commenting the same many times :)

However, I'm a bit unsure here, and would like to hear other opinions too.
What do you think?


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/service/frontend.cc
File be/src/service/frontend.cc:

http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/service/frontend.cc@255
PS5, Line 255:     return Status("Failed to get/create JVM");
could fit into one line with the condition


http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/util/jni-util.h
File be/src/util/jni-util.h:

http://gerrit.cloudera.org:8080/#/c/20010/5/be/src/util/jni-util.h@284
PS5, Line 284:   static Status GetMethodID(JNIEnv* env, jclass class_ref, const 
char* method_str,
comment pls


http://gerrit.cloudera.org:8080/#/c/20010/5/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/20010/5/common/thrift/PlanNodes.thrift@683
PS5, Line 683: IcebergMetadtatScanNode
typo


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/analysis/IcebergMetadataTableRef.java
File fe/src/main/java/org/apache/impala/analysis/IcebergMetadataTableRef.java:

http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/analysis/IcebergMetadataTableRef.java@59
PS5, Line 59:     analyzeTimeTravel(analyzer);
Is it possible to provide time travel spec with metadata table queries with 
this patch? I haven't seen anything related in the tests.


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/analysis/IcebergMetadataTableRef.java@61
PS5, Line 61:     Analyzer.checkTableCapability(getTable(), 
Analyzer.OperationType.READ);
a ranger test would be nice to check if users with insufficient permissions on 
the table can't query metadata tables


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/analysis/IcebergMetadataTableRef.java@63
PS5, Line 63:     analyzeHints(analyzer);
What hints can you provide with these queries? I guess join hints could be 
possible. Could you add test coverage?


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergMetadataTable.java
File 
fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergMetadataTable.java:

http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/catalog/iceberg/IcebergMetadataTable.java@49
PS5, Line 49:   private final static Logger LOG = 
LoggerFactory.getLogger(IcebergMetadataTable.class);
I don't think this is used.


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java@39
PS5, Line 39:   // Metadata table name
this comment doesn't add much value on top of what the variable name says


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java@40
PS5, Line 40:   protected final String metadataTableName_;
I see this is only used once (in toThrift()) so I wouldn't introduce a class 
member to store the meta table name. It's already stored in 
IcebergMetadataTableRef too.


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java@43
PS5, Line 43: TableRef
this can even be an IcebergMetadataTableRef, right?


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/planner/IcebergMetadataScanNode.java@53
PS5, Line 53:     assignConjuncts(analyzer);
assignConjuncts() and computeStats() are already done by calling super.init()


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1900
PS5, Line 1900:     } else if (table instanceof IcebergMetadataTable) {
I'm trying to understand the difference between this branch and the one where 
we have an FeFsTable that us an IcebergMetadataTable. In what use cases do we 
end up in this branch and in circumstances in L1879?

Q2: Shouldn't we rely on IcebergScanPlanner.createIcebergScanPlan() to create 
the IcebergMetadataScanNode for us as it does with IcebergScanNode?


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/service/JniFrontend.java
File fe/src/main/java/org/apache/impala/service/JniFrontend.java:

http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/service/JniFrontend.java@35
PS5, Line 35: import org.apache.iceberg.Accessor;
I see this many new imports, however, the new code added in this file doesn't 
justify why they are needed. What happened?


http://gerrit.cloudera.org:8080/#/c/20010/5/fe/src/main/java/org/apache/impala/service/JniFrontend.java@61
PS5, Line 61: // import org.apache.iceberg.io.CloseableIterable;
drop comment


http://gerrit.cloudera.org:8080/#/c/20010/5/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/20010/5/testdata/datasets/functional/functional_schema_template.sql@3702
PS5, Line 3702: ---- CREATE_HIVE
Do you have to create this table through Hive?


http://gerrit.cloudera.org:8080/#/c/20010/5/testdata/datasets/functional/functional_schema_template.sql@3715
PS5, Line 3715:
nit: extra empty line?


http://gerrit.cloudera.org:8080/#/c/20010/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test:

http://gerrit.cloudera.org:8080/#/c/20010/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@1
PS5, Line 1: ####
Can you also have some test coverage that some SQL that are valid for regular 
tables aren't valid for metadata tables. E.g. SHOW CREATE TABLE, DESCRIBE, etc.


http://gerrit.cloudera.org:8080/#/c/20010/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@2
PS5, Line 2: # Test 0 : Query all the metadatata tables once
All of the results in this test fits into one rowBatch. Could you have coverage 
for setting the RowBatch size lower to have the results in multiple batches?


http://gerrit.cloudera.org:8080/#/c/20010/5/testdata/workloads/functional-query/queries/QueryTest/iceberg-metadata-tables.test@311
PS5, Line 311: STRING
as we apparently are returning these complex types as strings would it require 
too much effort to return the stringified version of the complex values in the 
same way as we do now when they are present in the select list? I believe 
converting them into a JSON string is a simple function call, but I might am 
wrong



--
To view, visit http://gerrit.cloudera.org:8080/20010
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0e943cecd77f5ef7af7cd07e2b596f2c5b4331e7
Gerrit-Change-Number: 20010
Gerrit-PatchSet: 5
Gerrit-Owner: Tamas Mate <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Gergely Fürnstáhl <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 24 Aug 2023 15:12:45 +0000
Gerrit-HasComments: Yes

Reply via email to