Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Owen O'Malley Thu, 01 Mar 2018 10:48:07 -0800

+1 on the merge. We've been using it on the trunk of ORC for a while. It
will be great to have it released by Hadoop.


.. Owen

On Thu, Mar 1, 2018 at 10:31 AM, Vinayakumar B <vinayakum...@apache.org>
wrote:

> Definitely this would be great addition. Kudos to everyone's contributions.
>
> I am not a C++ expert. So cannot vote on code.
>
>   ---A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as
> a drop-in replacement for clients that only need read support (until
> libhdfs++
> also supports writes).
>
> Wouldn't it be nice to have write support as well before merge...?
> If everyone feels its okay to have read alone for now, I am okay anyway.
>
> On 1 Mar 2018 11:35 pm, "Jim Clampffer" <james.clampf...@gmail.com> wrote:
>
> > Thanks for the feedback Chris and Kai!
> >
> > Chris, do you mean potentially landing this in its current state and
> > handling some of the rough edges after?  I could see this working just
> > because there's no impact on any existing code.
> >
> > With regards to your questions Kai:
> > There isn't a good doc for the internal architecture yet; I just
> reassigned
> > HDFS-9115 to myself to handle that.  Are there any specific areas you'd
> > like to know about so I can prioritize those?
> > Here's some header files that include a lot of comments that should help
> > out for now:
> > -hdfspp.h - main header for the C++ API
> > -filesystem.h and filehandle.h - describes some rules about object
> > lifetimes and threading from the API point of view (most classes have
> > comments describing any restrictions on threading, locking, and
> lifecycle).
> > -rpc_engine.h and rpc_connection.h begin getting into the async RPC
> > implementation.
> >
> >
> > 1) Yes, it's a reimplementation of the entire client in C++.  Using
> > libhdfs3 as a reference helps a lot here but it's still a lot of work.
> > 2) EC isn't supported now, though that'd be great to have, and I agree
> that
> > it's going to be take a lot of effort to implement.  Right now if you
> tried
> > to read an EC file I think you'd get some unhelpful error out of the
> block
> > reader but I don't have an EC enabled cluster set up to test.  Adding an
> > explicit not supported message would be straightforward.
> > 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already
> > had so we get consistency checks on the C API.  There's a few new tests
> > that also get run on both libhdfs and libhdfs++ and make sure the
> expected
> > output is the same too.
> > 4) I agree, I just haven't had a chance to look into the distribution
> build
> > to see how to do it.  HDFS-9465 is tracking this.
> > 5) Not yet (HDFS-8765).
> >
> > Regards,
> > James
> >
> >
> >
> >
> > On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zhengkai...@alibaba-inc.com>
> > wrote:
> >
> > > The work sounds solid and great! + to have this.
> > >
> > > Is there any quick doc to take a glance at? Some quick questions to be
> > > familiar with:
> > > 1. Seems the client is all implemented in c++ without any Java codes
> (so
> > > no JVM overhead), which means lots of work, rewriting HDFS client.
> Right?
> > > 2.  Guess erasure coding feature isn't supported, as it'd involve
> > > significant development, right? If yes, what will it say when read
> > erasure
> > > coded file?
> > > 3. Is there any building/testing mechanism to enforce the consistency
> > > between the c++ part and Java part?
> > > 4. I thought the public header and lib should be exported when building
> > > the distribution package, otherwise hard to use the new C api.
> > > 5. Is the short-circuit read supported?
> > >
> > > Thanks.
> > >
> > >
> > > Regards,
> > > Kai
> > >
> > > ------------------------------------------------------------------
> > > 发件人：Chris Douglas <cdoug...@apache.org>
> > > 发送时间：2018年3月1日(星期四) 05:08
> > > 收件人：Jim Clampffer <james.clampf...@gmail.com>
> > > 抄 送：Hdfs-dev <hdfs-dev@hadoop.apache.org>
> > > 主 题：Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
> > >
> > > +1
> > >
> > > Let's get this done. We've had many false starts on a native HDFS
> > > client. This is a good base to build on. -C
> > >
> > > On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
> > > <james.clampf...@gmail.com> wrote:
> > > > Hi everyone,
> > > >
> > > > I'd like to start a thread to discuss merging the HDFS-
> > > 8707 aka libhdfs++
> > > > into trunk.  I sent originally sent a similar
> > > email out last October but it
> > > > sounds like it was buried by discussions about other feature merges
> > that
> > > > were going on at the time.
> > > >
> > > > libhdfs++ is an HDFS client written in C++ designed to be used in
> > > > applications that are written in non-JVM based
> > > languages.  In its current
> > > > state it supports kerberos authenticated reads from HDFS
> > > and has been used
> > > > in production clusters for over a year so it has had a
> > > significant amount
> > > > of burn-in time.  The HDFS-8707 branch has been around for about 2
> > years
> > > > now so I'd like to know people's thoughts on what it would take to
> > merge
> > > > current branch and handling writes and encrypted reads in a new one.
> > > >
> > > > Current notable features:
> > > >   -A libhdfs/libhdfs3 compatible C API that allows
> > > libhdfs++ to serve as a
> > > > drop-in replacement for clients that only need read support (until
> > > > libhdfs++ also supports writes).
> > > >   -An asynchronous C++ API with synchronous shims on top if the
> client
> > > > application wants to do blocking operations.  Internally a single
> > thread
> > > > (optionally more) uses select/epoll by way of boost::asio to watch
> > > > thousands of sockets without the overhead of spawning threads to
> > emulate
> > > > async operation.
> > > >   -Kerberos/SASL authentication support
> > > >   -HA namenode support
> > > >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> > > > "./hdfs dfs -chmod".  The major benefit of these is the
> > > tool startup time
> > > > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and
> occupies
> > a
> > > > lot less memory since it isn't dealing with the JVM.  This makes it
> > > > possible to do things like write a simple bash script that stats a
> > file,
> > > > applies some rules to the result, and decides if it
> > > should move it in a way
> > > > that scales to thousands of files without being penalized with O(N)
> JVM
> > > > startups.
> > > >   -Cancelable reads.  This has proven to be very useful in multiuser
> > > > applications that (pre)fetch large blocks of data but need to remain
> > > > responsive for interactive users.  Rather than waiting
> > > for a large and/or
> > > > slow read to finish it will return immediately and the
> > > associated resources
> > > > (buffer, file descriptor) become available for the rest
> > > of the application
> > > > to use.
> > > >
> > > > There are a couple known issues: the doc build isn't integrated with
> > the
> > > > rest of hadoop and the public API headers aren't being exported when
> > > > building a distribution.  A short term solution for
> > > missing docs is to go
> > > > through the libhdfs(3) compatible API and use the
> > > libhdfs docs.  Other than
> > > > a few modifications to the pom files to integrate the
> > > build and the changes
> > > > are isolated to a new directory so the chance of
> > > causing any regressions in
> > > > the rest of the code is minimal.
> > > >
> > > > Please share your thoughts, thanks!
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> > > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
> > >
> > >
> >
>

Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Reply via email to