Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17878 )

Change subject: IMPALA-10934: Enable table definition over a single file
......................................................................


Patch Set 2:

(1 comment)

Is this expected to work on my dev setup? When I run:
create table sfstest like functional.alltypes location 
"sfs+hdfs://localhost:20500/test-warehouse/alltypes/year=2009/month=9/090901.txt/#SINGLEFILE#";

I get:
ERROR: AnalysisException: No FileSystem for scheme "sfs+hdfs"
CAUSED BY: UnsupportedFileSystemException: No FileSystem for scheme "sfs+hdfs"
Maybe it needs a newer version of something.

A few questions:
1. If a user creates a table based on a single file and then they drop the 
table, what happens? Does the external vs managed distinction continue to 
apply? (i.e. it could delete the file)
2. I'm assuming there are very limited operations that we can do for this 
table. No inserts, no loads, etc. Is that right? What errors do these throw?
3. If the create statement specifies a schema that has partitioning (which 
would be subdirectories), do we throw an error?

For Impala, I'm wondering if it is possible to keep the SFS code limited to a 
small piece of the frontend. Basically, detect that a table is SFS (and thus 
some statements are not allowed) and then convert to the underlying filename 
and go from there with the rest of the code not needing knowledge of SFS.

http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc@142
PS2, Line 142: // The maximum number of SFS I/O threads.
             : DEFINE_int32(num_sfs_io_threads, 16, "Number of SFS I/O 
threads");
SFS maps down to some other storage type, and it should be using the logic for 
that underlying storage type. It's important for SFS+S3 to map down to S3 and 
SFS+Ozone to map down to Ozone, because we treat S3 differently from Ozone. For 
example, the file handle cache is only enabled for storage types that are known 
to work (and can lead to stability/performance issues if we don't do the check 
correctly).

If SFS is its own device name with its own set of threads, then it will lose 
those distinctions and there will be bugs.

If I'm understanding SFS correctly, then the Impala backend might not need to 
know about SFS at all. If the frontend knows that it is reading a single file 
table, it can convert the SFS filename to the actual real underlying file 
before sending it to the backend. The backend code then doesn't need any 
special changes for the read path.



--
To view, visit http://gerrit.cloudera.org:8080/17878
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I32be936243aa4c8320f5d06d2b7fbf98822f82e7
Gerrit-Change-Number: 17878
Gerrit-PatchSet: 2
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Comment-Date: Wed, 27 Oct 2021 23:32:07 +0000
Gerrit-HasComments: Yes

Reply via email to