cbevard1 commented on PR #5563:
URL: https://github.com/apache/hadoop/pull/5563#issuecomment-1517915320
> minor: checkstyle unhappy about line length...please keep at 100 chars or
less
No problem, will do.
> How about adding a new test case or modifying testRegularUpload() to
create a file with a name > 256 chars just see what happens?
Initially I had some code in the S3ADataBlock to trim the key if the file
name was nearing 255 chars, but after testing what would happen with a massive
S3 key passed to `File.createTempFile()` I noticed that it would automatically
truncate the end of the prefix to fit within the FS's char limit with the
appended the random number and suffix (".tmp" if one wasn't specified). I'll
add a unit test like you suggested so we can detect if `createTempFile` ever
stops truncating names that exceed the FS's max length .
Here's a snippet from the javadocs for `createTempFile()` that describes how
prefixes are handled
> "To create the new file, the prefix and the suffix may first be adjusted
to fit the limitations of the underlying platform. If the prefix is too long
then it will be truncated, but its first three characters will always be
preserved."
> Maybe the solution is to do some cutting down of paths such that first few
and final chars are always preserved. along with span ID that should be good,
though it does depend on filenames generated...does accumulo generate
sufficiently unique ones that the last, say, 128 chars will be something you
can map to an upload?
With Accumulo, it's the WALs that are important to recover and they're named
with a UUID, so they're very unique even without a prefix. The full WAL key is
built as such `bucket+instance_volume(folder/prefix)+tserver_hostname+UUID`, so
avoiding the 255 char limit is pretty doable if you keep your bucket and
instance volume names short. The longest file name I've seen so far in my
testing with the spanId now included was 175 chars. If I removed the S3 key
prefix from the name and kept just the UUID, that would be a little safer for
Accumulo, but could cause issues with other systems where prefix is more
important contextually than the name. I lean towards leaving the naming
convention as it is (s3ablock-PARTNUM-SPAN_ID-ESCAPED_S3_KEY-RANDOM_NUM.tmp).
Even if the UUID was truncated, just a few chars along with the part number
will almost always be enough to uniquely match it to a pending Multipart
Upload. If the key gets trimmed enough that it's not possible to derive the
recovery i
nformation you need from the file name, then there's the logs and span ID to
fall back on.
I'll get that test added and correct the style issue. Let me know if you'd
like me to make any changes the file name that's generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]