+1 on the merge. We've been using it on the trunk of ORC for a while. It will be great to have it released by Hadoop.
.. Owen On Thu, Mar 1, 2018 at 10:31 AM, Vinayakumar B <vinayakum...@apache.org> wrote: > Definitely this would be great addition. Kudos to everyone's contributions. > > I am not a C++ expert. So cannot vote on code. > > ---A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as > a drop-in replacement for clients that only need read support (until > libhdfs++ > also supports writes). > > Wouldn't it be nice to have write support as well before merge...? > If everyone feels its okay to have read alone for now, I am okay anyway. > > On 1 Mar 2018 11:35 pm, "Jim Clampffer" <james.clampf...@gmail.com> wrote: > > > Thanks for the feedback Chris and Kai! > > > > Chris, do you mean potentially landing this in its current state and > > handling some of the rough edges after? I could see this working just > > because there's no impact on any existing code. > > > > With regards to your questions Kai: > > There isn't a good doc for the internal architecture yet; I just > reassigned > > HDFS-9115 to myself to handle that. Are there any specific areas you'd > > like to know about so I can prioritize those? > > Here's some header files that include a lot of comments that should help > > out for now: > > -hdfspp.h - main header for the C++ API > > -filesystem.h and filehandle.h - describes some rules about object > > lifetimes and threading from the API point of view (most classes have > > comments describing any restrictions on threading, locking, and > lifecycle). > > -rpc_engine.h and rpc_connection.h begin getting into the async RPC > > implementation. > > > > > > 1) Yes, it's a reimplementation of the entire client in C++. Using > > libhdfs3 as a reference helps a lot here but it's still a lot of work. > > 2) EC isn't supported now, though that'd be great to have, and I agree > that > > it's going to be take a lot of effort to implement. Right now if you > tried > > to read an EC file I think you'd get some unhelpful error out of the > block > > reader but I don't have an EC enabled cluster set up to test. Adding an > > explicit not supported message would be straightforward. > > 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already > > had so we get consistency checks on the C API. There's a few new tests > > that also get run on both libhdfs and libhdfs++ and make sure the > expected > > output is the same too. > > 4) I agree, I just haven't had a chance to look into the distribution > build > > to see how to do it. HDFS-9465 is tracking this. > > 5) Not yet (HDFS-8765). > > > > Regards, > > James > > > > > > > > > > On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zhengkai...@alibaba-inc.com> > > wrote: > > > > > The work sounds solid and great! + to have this. > > > > > > Is there any quick doc to take a glance at? Some quick questions to be > > > familiar with: > > > 1. Seems the client is all implemented in c++ without any Java codes > (so > > > no JVM overhead), which means lots of work, rewriting HDFS client. > Right? > > > 2. Guess erasure coding feature isn't supported, as it'd involve > > > significant development, right? If yes, what will it say when read > > erasure > > > coded file? > > > 3. Is there any building/testing mechanism to enforce the consistency > > > between the c++ part and Java part? > > > 4. I thought the public header and lib should be exported when building > > > the distribution package, otherwise hard to use the new C api. > > > 5. Is the short-circuit read supported? > > > > > > Thanks. > > > > > > > > > Regards, > > > Kai > > > > > > ------------------------------------------------------------------ > > > 发件人:Chris Douglas <cdoug...@apache.org> > > > 发送时间:2018年3月1日(星期四) 05:08 > > > 收件人:Jim Clampffer <james.clampf...@gmail.com> > > > 抄 送:Hdfs-dev <hdfs-dev@hadoop.apache.org> > > > 主 题:Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk > > > > > > +1 > > > > > > Let's get this done. We've had many false starts on a native HDFS > > > client. This is a good base to build on. -C > > > > > > On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer > > > <james.clampf...@gmail.com> wrote: > > > > Hi everyone, > > > > > > > > I'd like to start a thread to discuss merging the HDFS- > > > 8707 aka libhdfs++ > > > > into trunk. I sent originally sent a similar > > > email out last October but it > > > > sounds like it was buried by discussions about other feature merges > > that > > > > were going on at the time. > > > > > > > > libhdfs++ is an HDFS client written in C++ designed to be used in > > > > applications that are written in non-JVM based > > > languages. In its current > > > > state it supports kerberos authenticated reads from HDFS > > > and has been used > > > > in production clusters for over a year so it has had a > > > significant amount > > > > of burn-in time. The HDFS-8707 branch has been around for about 2 > > years > > > > now so I'd like to know people's thoughts on what it would take to > > merge > > > > current branch and handling writes and encrypted reads in a new one. > > > > > > > > Current notable features: > > > > -A libhdfs/libhdfs3 compatible C API that allows > > > libhdfs++ to serve as a > > > > drop-in replacement for clients that only need read support (until > > > > libhdfs++ also supports writes). > > > > -An asynchronous C++ API with synchronous shims on top if the > client > > > > application wants to do blocking operations. Internally a single > > thread > > > > (optionally more) uses select/epoll by way of boost::asio to watch > > > > thousands of sockets without the overhead of spawning threads to > > emulate > > > > async operation. > > > > -Kerberos/SASL authentication support > > > > -HA namenode support > > > > -A set of utility programs that mirror the HDFS CLI utilities e.g. > > > > "./hdfs dfs -chmod". The major benefit of these is the > > > tool startup time > > > > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and > occupies > > a > > > > lot less memory since it isn't dealing with the JVM. This makes it > > > > possible to do things like write a simple bash script that stats a > > file, > > > > applies some rules to the result, and decides if it > > > should move it in a way > > > > that scales to thousands of files without being penalized with O(N) > JVM > > > > startups. > > > > -Cancelable reads. This has proven to be very useful in multiuser > > > > applications that (pre)fetch large blocks of data but need to remain > > > > responsive for interactive users. Rather than waiting > > > for a large and/or > > > > slow read to finish it will return immediately and the > > > associated resources > > > > (buffer, file descriptor) become available for the rest > > > of the application > > > > to use. > > > > > > > > There are a couple known issues: the doc build isn't integrated with > > the > > > > rest of hadoop and the public API headers aren't being exported when > > > > building a distribution. A short term solution for > > > missing docs is to go > > > > through the libhdfs(3) compatible API and use the > > > libhdfs docs. Other than > > > > a few modifications to the pom files to integrate the > > > build and the changes > > > > are isolated to a new directory so the chance of > > > causing any regressions in > > > > the rest of the code is minimal. > > > > > > > > Please share your thoughts, thanks! > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org > > > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org > > > > > > > > >