Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16318 )
Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1 ...................................................................... Patch Set 17: (13 comments) http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/hdfs-fs-cache.h File be/src/runtime/hdfs-fs-cache.h: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/hdfs-fs-cache.h@59 PS17, Line 59: NULL nit: use nullptr instead of NULL http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/disk-file.h File be/src/runtime/io/disk-file.h: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/disk-file.h@96 PS17, Line 96: /// Return the lock of the file. : boost::shared_mutex* GetFileLock() { return &lock_; } : : /// Return the status lock of the file. : SpinLock* GetStatusLock() { return &status_lock_; } I think we typically try to avoid returning internal locks publicly. is there a reason these locks need to be exposed to other classes? can the code be re-factored so that any code that needs to acquire these locks lives inside this class? http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/disk-file.h@123 PS17, Line 123: volatile why volatile? I've generally found the semantics of volatile in C++ to be quite complex and hard to get right (e.g. see https://sites.google.com/site/kjellhedstrom2/stay-away-from-volatile-in-threaded-code) http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/disk-io-mgr.cc@253 PS17, Line 253: // The file handle is closed, set the file to persisted status. : lock_guard<SpinLock> l(disk_file_->status_lock_); : disk_file_->SetPersisted(); bit of an anti-pattern to acquire a external classes internal locks. its probably cleaner to just have the SetPersisted method acquire the lock and then sate the state to PERSISTED. http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/file-writer.h File be/src/runtime/io/file-writer.h: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/file-writer.h@52 PS17, Line 52: virtual Status WriteOne(WriteRange*) = 0; not sure I understand the difference between WriteOne and Write http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/file-writer.h@68 PS17, Line 68: std::mutex lock_; would be nice to explain the thread-safety semantics of this class http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/file-writer.h@67 PS17, Line 67: DiskIoMgr* io_mgr_ = nullptr; : std::mutex lock_; : int64_t written_bytes_ = 0; : const char* file_path_; : const int64_t file_size_; : const int64_t max_page_size_; nit: would be nice to have some docs on each of these http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/hdfs-file-reader.cc File be/src/runtime/io/hdfs-file-reader.cc: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/io/hdfs-file-reader.cc@98 PS17, Line 98: Status status = Status::OK(); nit: is this change necessary? http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/tmp-file-mgr-internal.h File be/src/runtime/tmp-file-mgr-internal.h: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/tmp-file-mgr-internal.h@46 PS17, Line 46: /// Runtime Error during creation of the TmpFile. : class ResourceError : public std::runtime_error { : public: : explicit ResourceError(const Status& status) : : runtime_error(status.msg().msg()), status_(status) {} : virtual ~ResourceError() {} : Status& GetStatus() { return status_; } : : private: : Status status_; : }; see other comment about C++ errors, probably should be removed. should be able to use 'Status' objects instead. http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/13/be/src/runtime/tmp-file-mgr.cc@235 PS13, Line 235: // Initialize the S3 options for later getting S3 connection. : s3a_options_ = {make_pair("fs.s3a.fast.upload", "tru probably don't want to hardcode these http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/tmp-file-mgr.cc@236 PS17, Line 236: s3a_options_ = {make_pair("fs.s3a.fast.upload", "true"), : make_pair("fs.s3a.fast.upload.buffer", "disk")}; is the implementation of this feature dependent on setting these values? I don't think we want to set these values in the code itself, these types of configs are usually configured externally in an xml file (e.g. hadoop-site.xml or hdfs-site.xml) http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/runtime/tmp-file-mgr.cc@728 PS17, Line 728: throw TmpFile::ResourceError( : Status(GetHdfsErrorMsg("Failed to connect to FS: ", hdfs_url))); we try to avoid C++ errors. Impala uses 'Status' objects instead. http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/util/hdfs-util.h File be/src/util/hdfs-util.h: http://gerrit.cloudera.org:8080/#/c/16318/17/be/src/util/hdfs-util.h@31 PS17, Line 31: ABFS_1 nit: ABFSS or ABFS_SECURE would probably be a better name -- To view, visit http://gerrit.cloudera.org:8080/16318 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89 Gerrit-Change-Number: 16318 Gerrit-PatchSet: 17 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Fri, 23 Oct 2020 16:59:02 +0000 Gerrit-HasComments: Yes
