Alex Behm has posted comments on this change.

Change subject: IMPALA-4029: Reduce memory requirements for storing file 
metadata
......................................................................


Patch Set 3:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/6406/3/common/fbs/CatalogObjects.fbs
File common/fbs/CatalogObjects.fbs:

Line 36:   offset: long = 0 (id: 0);
Why a default value? Seems potentially dangerous.

Not for this change, but I'm thinking we don't even need to store the offset. 
If we know the block size, than we can derive the offset (assuming the list of 
file blocks is ordered by offset). Might be worth adding a TODO or recording 
that idea somewhere.


Line 40:   length: long = -1 (id: 1);
Seems redundant, why keep it?


Line 65:   compression: FbCompression (id: 3);
Will FlatBuffers add padding to align members? Ideally, we'd optimize for space 
and not access speed.


http://gerrit.cloudera.org:8080/#/c/6406/3/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

Line 218:   1: required binary file_desc_data
Nice


Line 296:   10: optional list<string> file_name_prefixes
Is this required for the move to flat buffers? I think we should consider an 
HdfsPathSet abstraction that assigns path ids and internally compresses the 
underlying strings. There's a lot of manual lookups and stitching in the 
current code. I don't feel too strongly about whether we should do that now, or 
clean up the code later.


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
File fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java:

Line 78:   public byte toFbCompression() {
toFlatBuffer()?


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

Line 126:         locations = fileSystem.getFileBlockLocations(fileStatus, 0, 
fileStatus.getLen());
I think it would be better for now to keep all the code that fetches 
information from external systems in one place (HdfsTable). Splitting up the 
loading and delegating to several classes may make sense, but that probably 
requires significant surgery, and the current fetching code is very much 
centralized (we iterate over all files of in a table).

The loading code in this patch is more confusing to me than before. The meaning 
of some verbs like load/create is less clear.

If you agree with that direction, we may not need a FileDescriptor class at 
all, and can only rely on the FB to hold the data. It may still make sense to 
have a FileDescBuilder which you can use to construct a FbFileDesc.


Line 227:             loc.getNames().length);
Another case where we might be calling out to the NN.


Line 321:     private static int REPLICA_HOST_CACHE_MASK = 0x8000;
Less code and more readable to have one var with (1 << 15). The places that 
need to & the mask can bit-wise invert, i.e. (x & ~MASK).


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 189:   // List of file name prefixes.
Sorted list of file name prefixes?


Line 309:         Pair<Integer, String> compressedFileName =
How predictable is the compression behavior? Do we iterate over the files in 
lexicographical order for both HDFS and S3?
I'm worried about the case where an "invalidate metadata" suddenly leads to a 
higher memory requirement even if no files/partitions have changed.


Line 340:    * the suffix is equal to 'fileName'.
Should mention that this function expects the fileNames to be sorted.


Line 346:     String commonPrefix = Strings.commonPrefix(prevFileName, 
fileName);
I'm wondering if it's better to first check the last common prefix instead of 
the prefix between the current and prev file name. In the current version it 
seems like the list of prefixes could contain strings which are a common prefix 
of another one.


-- 
To view, visit http://gerrit.cloudera.org:8080/6406
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I483d3cadc9d459f71a310c35a130d073597b0983
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to