[ https://issues.apache.org/jira/browse/HDFS-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129108#comment-13129108 ]
Alejandro Abdelnur commented on HDFS-2178: ------------------------------------------ *On Sanjay's create and append:* You are correct, an HDFS proxy deployment does not need to do a redirection (to a DN); it will be handled itself by the proxy. Still, for authentication purposes a probing should be done before attempting uploading data. Because of this the create & append requests are identical in the hdfs-proxy (hoop) and in the built-in (NN&DN http serving) modes. In the case of hdfs-proxy the probing is for auth only, in the case of built-in the probing is for both authentication and potential redirection. This means that we can have the exact same API for both hdfs-proxy and built-in modes. Still the use of 100-continue is an open issue, more of this at the end of this comment. *On Sanjay's comment on 'some thoughts of webhdfs & hoop':* * Support for trusted proxies (doAs functionality) it does make sense in the case of hdfs-proxy and it is already supported by Hoop. I.e. server-side apps that need/want HTTP access to HDFS and act on behalf of other users. I.e. for somebody using the Java API to access HDFS via hdfs-proxy and using a doAs block. * Support for delegation tokens to access hdfs-proxy it does make sense. I.e. when using distcp via hdfs-proxy; in this case, delegation tokens should work across clusters (this may not be supported today but IMO it should eventually work). * You meantion code/param/return clean up. What kind of clean up are you referring to? *On Sanjay's 'As we move forward':* * What subset of webhdfs API makes sense for a proxy? IMO, they should be identical, a user should not see a difference if they access a built-in or an hdfs-proxy HTTP setup. * Regarding a 'pure proxy'. This would be more like a reverse proxy and then all URLs would have to be relative or resolved with knowledge of the reverse proxy. IMO, a hdfs-proxy on its own has its merits. *Open issues:* 1* *Use of 100-CONTINUE for create & append*, it seems not all client HTTP libraries handle this (JDK HttpURLConnection to start). Plus the servlet API does not provide support for it, it seems some servlet containers handle it but in a way that it is non-standard (http://jira.codehaus.org/browse/JETTY-341) or in a way that it never reaches the servlet (http://stackoverflow.com/questions/848378/sending-100-continue-using-java-servlet-api). Because of this I'm inclined to use a handle request as shown in the attached API doc. 2* *Are we OK with the attached API* (except for the discussion on #1)? 3* *Codebase*, Hoop was using TestNG for testcases and non-apache package names, I've been working on refactoring to work with JUnit, to refactor package names and to organize the code in a way that fits in the current source layout. In the mean time, for webhdfs (built-in http) some code from Hoop has been cloned, modified and integrated into HDFS. This code has changed significantly, thus integrating it with Hoop will require some serious rewriting of Hoop. Giving the current timeframe we are shooting for 0.23, should we add Hoop as a separate module to have hdfs-proxy like support and later see how merge the code? > Contributing Hoop to HDFS, replacement for HDFS proxy with read/write > capabilities > ---------------------------------------------------------------------------------- > > Key: HDFS-2178 > URL: https://issues.apache.org/jira/browse/HDFS-2178 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 0.23.0 > Reporter: Alejandro Abdelnur > Assignee: Alejandro Abdelnur > Fix For: 0.23.0 > > Attachments: HDFSoverHTTP-API.html, HdfsHttpAPI.pdf > > > We'd like to contribute Hoop to Hadoop HDFS as a replacement (an improvement) > for HDFS Proxy. > Hoop provides access to all Hadoop Distributed File System (HDFS) operations > (read and write) over HTTP/S. > The Hoop server component is a REST HTTP gateway to HDFS supporting all file > system operations. It can be accessed using standard HTTP tools (i.e. curl > and wget), HTTP libraries from different programing languages (i.e. Perl, > Java Script) as well as using the Hoop client. The Hoop server component is a > standard Java web-application and it has been implemented using Jersey > (JAX-RS). > The Hoop client component is an implementation of Hadoop FileSystem client > that allows using the familiar Hadoop filesystem API to access HDFS data > through a Hoop server. > Repo: https://github.com/cloudera/hoop > Docs: http://cloudera.github.com/hoop > Blog: http://www.cloudera.com/blog/2011/07/hoop-hadoop-hdfs-over-http/ > Hoop is a Maven based project that depends on Hadoop HDFS and Alfredo (for > Kerberos HTTP SPNEGO authentication). > To make the integration easy, HDFS Mavenization (HDFS-2096) would have to be > done first, as well as the Alfredo contribution (HADOOP-7119). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira