[Impala-ASF-CR] IMPALA-10934: Enable table definition over a single file

Joe McDonnell (Code Review) Wed, 27 Oct 2021 16:36:22 -0700

Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17878 )

Change subject: IMPALA-10934: Enable table definition over a single file
......................................................................

Patch Set 2:

(1 comment)

Is this expected to work on my dev setup? When I run:
create table sfstest like functional.alltypes location
"sfs+hdfs://localhost:20500/test-warehouse/alltypes/year=2009/month=9/090901.txt/#SINGLEFILE#";

I get:
ERROR: AnalysisException: No FileSystem for scheme "sfs+hdfs"
CAUSED BY: UnsupportedFileSystemException: No FileSystem for scheme "sfs+hdfs"
Maybe it needs a newer version of something.

A few questions:
1. If a user creates a table based on a single file and then they drop the
table, what happens? Does the external vs managed distinction continue to
apply? (i.e. it could delete the file)
2. I'm assuming there are very limited operations that we can do for this
table. No inserts, no loads, etc. Is that right? What errors do these throw?
3. If the create statement specifies a schema that has partitioning (which
would be subdirectories), do we throw an error?

For Impala, I'm wondering if it is possible to keep the SFS code limited to a
small piece of the frontend. Basically, detect that a table is SFS (and thus
some statements are not allowed) and then convert to the underlying filename
and go from there with the rest of the code not needing knowledge of SFS.

http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc@142
PS2, Line 142: // The maximum number of SFS I/O threads.
: DEFINE_int32(num_sfs_io_threads, 16, "Number of SFS I/O
threads");
SFS maps down to some other storage type, and it should be using the logic for
that underlying storage type. It's important for SFS+S3 to map down to S3 and
SFS+Ozone to map down to Ozone, because we treat S3 differently from Ozone. For
example, the file handle cache is only enabled for storage types that are known
to work (and can lead to stability/performance issues if we don't do the check
correctly).

If SFS is its own device name with its own set of threads, then it will lose
those distinctions and there will be bugs.

If I'm understanding SFS correctly, then the Impala backend might not need to
know about SFS at all. If the frontend knows that it is reading a single file
table, it can convert the SFS filename to the actual real underlying file
before sending it to the backend. The backend code then doesn't need any
special changes for the read path.

--
To view, visit http://gerrit.cloudera.org:8080/17878
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I32be936243aa4c8320f5d06d2b7fbf98822f82e7
Gerrit-Change-Number: 17878
Gerrit-PatchSet: 2
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Comment-Date: Wed, 27 Oct 2021 23:32:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10934: Enable table definition over a single file

Reply via email to