[jira] [Commented] (HDFS-7240) Object store in HDFS

Steve Loughran (JIRA) Thu, 02 Nov 2017 10:59:23 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236271#comment-16236271
 ]


Steve Loughran commented on HDFS-7240:
--------------------------------------

Anu, I've got some more comments. Given the size of this JIRA, & no of 
watchers, I'm going to suggest a low-level "merge HDFS-7240 in JIRA"where we 
can discuss the low level code details, and attach a .patch of the entire thing 
for Jenkins/yetus to handle. This is how we've done the s3guard work and it 
helps split code issues from more strategic things like Constantin.

For large scale tests, make sure you have test which scale and the test runner 
timeout designed to support multi hour tests. AWS s3 now supports multi-TB 
files through multipart uploads; we do run multi-GB long-haul uploads as part 
of the release process as follows: run

Now, more client side comments. I think this is my client side done, and its 
more at the HTTP REST API level than anything else


h3. OzoneFileSystem

* implement getDefaultBlockSize(); add a config option to let people set it. 
add a sensible default like 64 or 128 MB.
* you could implement {{copyFromLocalFile}} and {{copyToLocalFile}} trivially 
using bucket.putKey(dst, path) & bucket.getKey(). This lines you up for 
HADOOP-14766, which is a high performance upload from the local FS to a store


h3. org.apache.hadoop.ozone.web.ozShell.Shell

# {{dispatch}} always returns "1", so doesn't differentiate success (0) from 
any other error.
# run(String[] args) doesn't check for `parseHandler` returning null; will NPE 
on parse error, when normally you'd want to print a usage command.
# Shell bucket options should return some explicit usage erorr


Have a look at what we've done in 
{{org.apache.hadoop.fs.s3a.s3guard.S3GuardTool}} w.r.t raising exceptions. 
* Any tool (for you: handler) can raise a UnknownOptionException which triggers 
the usage() command, no stack trace printed. 
* Any `ExitUtils.ExitException` sets the exit code of the 
* we use `ExitUtil.terminate(status, text)` to exit the shell in main(); which 
can be turned off, so that tests
can invoke the full CLI, verify failure modes


Tests: I don't see any CLI tests. Look at {{TestS3GuardCLI}} to see how we do 
CLI testing, lookin gfor those ExitExceptions & asserting on the return value,
as well as grabbing all the string output (which can be printed to a string as 
well as stdout) & checking that.

h3. OzoneRestClient

Need to plan for a common failure mode being wrong endpoint, possibly returning 
plaintext or HTML error messages. Causes include: client config, proxy things. 
Also failures where the response is cut off partway through the read. 
Content-length is your friend here.

* All HTTP requests MUST verify content-type of response. Otherwise a GET of an 
http page will return 200 & trigger a parse failure, when its probably "you 
just got the the URL wrong".
* Error handling should handle content: text/plain or text/html and build a 
string from it. Why? Jetty &c can raise their own exceptions and they will 
return text.
* Probably good to print the URL being used here on HTTP failures, as it helps 
debug foundational config issues of "Wrong endpoint"
* Anything downloading to a string before parsing should look for a 
Content-Length header, and, if set (1) verify that it's within a sensible range 
& (2) use it to size the download. Actually, EntityUtils.toString(entity) does 
that, but it limits the size of a response to 4KB for that reason. Are you 
confident that all responses parsed that way will be <= 4KB long? I'd write 
tests there.
* You can parse JSON straight off an InputStream: do that for any operation 
which can return large amounts of data.
* why does {{setEndPointURI(URI)}} not just use {{Preconditions.checkArgument}} 
and throw an InvalidArgumentException?
* for consistency {{setEndPoint(String clusterFQDN)}}  needs a 
{{Preconditions.checkArgument(Strings.isNullOrEmpty(clusterFQDN))}}.
* delete(). Allow for the full set of delete responses (200, 204). Less 
critical against your own endpoint, but still best practise.
* putKey. It's very inefficient to generate the checksum before the upload, as 
it means 1x scan of the data before the upload. This won't scale efficiently to 
multiGB uploads. What about: calculate during the put and have it returned by 
the endpoint; client to verify the resulting value. This does mean that a 
corrupted PUT will overwrite the dest data.
* If you do use the checksum, verify that the result of an invalid checksum on 
PUT returns an exception which is handled in OzoneBucket.executePutKey(), that 
is: 400. 422 or whatever is sent back by the web engine. Of course, invalid 
checksum must trigger a failure.
* listVolumes: what's the max # of volumes? Can all volumes be returned in a 
single 4KB payload
* getKey: verify content type; require content-length & use when reading file. 
Otherwise you won't pick up broken GET
* Add a LOG.debug() before every HTTP request for debugging of live systems.


I see handling pooled connections is on the TODO list. Be good to have, 
obviously. It will make a significant difference on metadata ops, List etc. The 
big perf killer here is getKeyInfo, which is executed 1 or 2 times on 
OzoneFileSystem.getFileStatus(), and that is used throughout the hadoop stack 
on the assumption it's a low cost operation. Here it involves at worst setting 
up two HTTP connections. At least here they are not HTTPS connections, which 
are much more expensive.


h3. Ozone Bucket

* OzoneBucket.executeListKeys: again, needs to move to direct JSON parse to 
handle response > 4GB
* OzoneBucket.listKeys: javadocs to indicate "number of entries to return" to 
distinguish from response length in chars
* OzoneBucket.executeGetKeyInfo. 

h3. REST protocol itself

* I'd have expected KeyInfo to be a HEAD with the fields returned as different 
headers to parse. That way the same headers can be returned on a GET, and its 
consistent with the RESTy way. Many of the keyinfo fields are the standard HTTP 
headers anyway (Content-Length, checksum, last modified)
* And its potentially useful to be able to support custom headers in future.

h3. Testing with TestOzoneRpcClient

* Add a rule for timeouts; critical here
* Parameterize it for both REST and RPC protocols, so guaranteeing equivalent 
behaviour & identify regressions.
* TestOzoneRpcClient to use LambdaTestUtils.intercept() to catch the IOE and 
check the message. It will (correctly) fail when an operation doesn't raise an 
exception, and include the output of the operation instead. 
{{LambdaTestUtils.intercept(IOException.class, "VOLUME_ALREADY_EXISTS", () -> 
Store.createVolume(volumeName))}}

Tests to add 

* Add a test where the OzoneRestClient is pointed at a MiniDFSCluster RPC 
endpoint, e,g http://localhost:8088/ to verify it fails meaningfully
* Add a test where it's pointed at the miniDFS cluster web UI, again, expect 
failure''
* Create a volume with an odd name like '%20' or '&'; see what happens.
* Create lots of volumes with long names. List them. 
* Create lots of long keys, verify that listing works




> Object store in HDFS
> --------------------
>
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>            Priority: Major
>         Attachments: HDFS-7240.001.patch, HDFS-7240.002.patch, 
> HDFS-7240.003.patch, HDFS-7240.003.patch, HDFS-7240.004.patch, 
> Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf, ozone_user_v0.pdf
>
>
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a 
> generic storage layer. Using the Block Pool abstraction, new kinds of 
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode 
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-7240) Object store in HDFS

Reply via email to