Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16318 )
Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1 ...................................................................... Patch Set 13: (5 comments) http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/hdfs-fs-cache.cc File be/src/runtime/hdfs-fs-cache.cc: http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/hdfs-fs-cache.cc@103 PS13, Line 103: if (options != nullptr && !options->empty()) { : for (auto option : *options) { : hdfsBuilderConfSetStr( : hdfs_builder, option.first.c_str(), option.second.c_str()); : } : } does this actually work? were you able to confirm that non-default configs set this way are actually picked up by the returned hdfsFS object? according to the comment above, hdfsBuilderSetForceNewInstance has to be called in order for hdfsBuilderConfSetStr to have an affect. http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/io/disk-io-mgr.h File be/src/runtime/io/disk-io-mgr.h: http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/io/disk-io-mgr.h@370 PS13, Line 370: REMOTE_DFS_DISK_FILE_OPER_OFFSET, : REMOTE_S3_DISK_FILE_OPER_OFFSET, you cover this a bit in the commit message, but is the reason they were added to de-couple HDFS/S3 scans / writes from HDFS/S3 tmp-file-mgr operations? I guess that benefit is isolate between operations that spill vs. operations that read / write data on behalf of queries? it would be good to add some documentation here explaining why these two queues are necessary and when to, for example, use REMOTE_S3_DISK_FILE_OPER_OFFSET vs. REMOTE_S3_DISK_OFFSET http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/io/disk-io-mgr.cc@30 PS13, Line 30: #include "runtime/tmp-file-mgr-internal.h" Why are we including the tmp-file-mgr here? I see a lot of references to the tmp-file-mgr, but ideally I would think the disk-io-mgr and tmp-file-mgr would remain separate systems. I see a lot of code referring to tmp_files in this class as well, which I concerning. I would think that the disk-io-mgr should just provide an interface to the tmp-file-mgr. http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/io/disk-io-mgr.cc@241 PS13, Line 241: goto end; I think we generally try to avoid 'goto' statements because they make the control flow harder to follow. in this case the 'goto end' is only used once so I don't think it is necessary. http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/io/request-ranges.h File be/src/runtime/io/request-ranges.h: http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/io/request-ranges.h@113 PS13, Line 113: READ, : WRITE, : FETCH, : UPLOAD, not sure I understand the difference between all of these. fetch seems like it would be the same thing as read, and upload seems like the same thing as write. -- To view, visit http://gerrit.cloudera.org:8080/16318 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89 Gerrit-Change-Number: 16318 Gerrit-PatchSet: 13 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Wed, 30 Sep 2020 21:55:55 +0000 Gerrit-HasComments: Yes
