Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/17878 )
Change subject: IMPALA-10934: Enable table definition over a single file ...................................................................... Patch Set 2: (1 comment) Is this expected to work on my dev setup? When I run: create table sfstest like functional.alltypes location "sfs+hdfs://localhost:20500/test-warehouse/alltypes/year=2009/month=9/090901.txt/#SINGLEFILE#"; I get: ERROR: AnalysisException: No FileSystem for scheme "sfs+hdfs" CAUSED BY: UnsupportedFileSystemException: No FileSystem for scheme "sfs+hdfs" Maybe it needs a newer version of something. A few questions: 1. If a user creates a table based on a single file and then they drop the table, what happens? Does the external vs managed distinction continue to apply? (i.e. it could delete the file) 2. I'm assuming there are very limited operations that we can do for this table. No inserts, no loads, etc. Is that right? What errors do these throw? 3. If the create statement specifies a schema that has partitioning (which would be subdirectories), do we throw an error? For Impala, I'm wondering if it is possible to keep the SFS code limited to a small piece of the frontend. Basically, detect that a table is SFS (and thus some statements are not allowed) and then convert to the underlying filename and go from there with the rest of the code not needing knowledge of SFS. http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc@142 PS2, Line 142: // The maximum number of SFS I/O threads. : DEFINE_int32(num_sfs_io_threads, 16, "Number of SFS I/O threads"); SFS maps down to some other storage type, and it should be using the logic for that underlying storage type. It's important for SFS+S3 to map down to S3 and SFS+Ozone to map down to Ozone, because we treat S3 differently from Ozone. For example, the file handle cache is only enabled for storage types that are known to work (and can lead to stability/performance issues if we don't do the check correctly). If SFS is its own device name with its own set of threads, then it will lose those distinctions and there will be bugs. If I'm understanding SFS correctly, then the Impala backend might not need to know about SFS at all. If the frontend knows that it is reading a single file table, it can convert the SFS filename to the actual real underlying file before sending it to the backend. The backend code then doesn't need any special changes for the read path. -- To view, visit http://gerrit.cloudera.org:8080/17878 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I32be936243aa4c8320f5d06d2b7fbf98822f82e7 Gerrit-Change-Number: 17878 Gerrit-PatchSet: 2 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Comment-Date: Wed, 27 Oct 2021 23:32:07 +0000 Gerrit-HasComments: Yes
