Re: Why single thread for HDFS?

elton sky Mon, 05 Jul 2010 17:02:03 -0700

>There's actually an open ticket somewhere to make distcp do this using the
>new concat() API in the NameNode.
Where can I find that "open ticket"?


>concat() allows several files to be combined into one file at the metadata
level, so long as a number of
>restrictions are met. The work hasn't been done yet, but the concat() call
is there and waiting for a user.
Well, this sounds good when you have many small files, you concat() them
into a big one. I am talking about split a big file into blocks and copy all
a few blocks in parallel.


On Tue, Jul 6, 2010 at 3:01 AM, Todd Lipcon <[email protected]> wrote:

> On Mon, Jul 5, 2010 at 5:08 AM, elton sky <[email protected]> wrote:
>
> > Segel, Jay
> > Thanks for reply!
> >
> > >Your parallelism comes from multiple tasks running on different nodes
> > within the cloud. By >default you get one map/reduce job per block. You
> can
> > write your own splitter to increase >this and then get more parallelism.
> > sounds like an elegant solution. We can modify the 'distcp', using a
> simple
> > MR job, make it based on block rather than file.
> >
>
> There's actually an open ticket somewhere to make distcp do this using the
> new concat() API in the NameNode. concat() allows several files to be
> combined into one file at the metadata level, so long as a number of
> restrictions are met. The work hasn't been done yet, but the concat() call
> is there and waiting for a user.
>
> -Todd
>
>
> >
> > >in practice, you very rarely know how big your output is going to be
> > before
> > it's produced, so >this doesn't really work
> > I think you got the point why Yahoo make this design descision.
> > Multithreading only applicable when you know the size of the file, like
> > copy
> > existing files, so you can split them and feed to different threads.
> >
> > On Sat, Jul 3, 2010 at 1:24 AM, Jay Booth <[email protected]> wrote:
> >
> > > Yeah, a good way to think of it is that parallelism is achieved at the
> > > application level.
> > >
> > > On the input side, you can process multiple files in parallel or one
> > > file in parallel by logically splitting and opening multiple readers
> > > of the same file at multiple points.  Each of these readers is single
> > > threaded, because, well, you're returning a stream of bytes in order.
> > > It's inherently serial.
> > >
> > > On the reduce side, multiple reduces run, writing to multiple files in
> > > the same directory.  Again, you can't really write to a single file in
> > > parallel effectively -- you can't write byte 26 before byte 25,
> > > because the file's not that long yet.
> > >
> > > Theoretically, maybe you could have all reduces write to the same file
> > > by allocating some amount of space ahead of time and writing to the
> > > blocks in parallel - in practice, you very rarely know how big your
> > > output is going to be before it's produced, so this doesn't really
> > > work.  Multiple files in the same directory achieves the same goal
> > > much more elegantly, without exposing a bunch of internal details of
> > > the filesystem to user space.
> > >
> > > Does that make sense?
> > >
> > >
> > >
> > > On Fri, Jul 2, 2010 at 9:26 AM, Segel, Mike <[email protected]> wrote:
> > > > Actually they also listen here and this is a basic question...
> > > >
> > > > I'm not an expert, but how does having multiple threads really help
> > this
> > > problem?
> > > >
> > > > I'm assuming you're talking about a map/reduce job and not some
> > specific
> > > client code which is being run on a client outside of the
> > cloud/cluster....
> > > >
> > > > I wasn't aware that you could easily synchronize threads running on
> > > different JVMs. ;-)
> > > >
> > > > Your parallelism comes from multiple tasks running on different nodes
> > > within the cloud. By default you get one map/reduce job per block. You
> > can
> > > write your own splitter to increase this and then get more parallelism.
> > > >
> > > > HTH
> > > >
> > > > -Mike
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Hemanth Yamijala [mailto:[email protected]]
> > > > Sent: Friday, July 02, 2010 2:56 AM
> > > > To: [email protected]
> > > > Subject: Re: Why single thread for HDFS?
> > > >
> > > > Hi,
> > > >
> > > > Can you please post this on [email protected] ? I suspect
> the
> > > > most qualified people to answer this question would all be on that
> > > > list.
> > > >
> > > > Hemanth
> > > >
> > > > On Fri, Jul 2, 2010 at 11:43 AM, elton sky <[email protected]>
> > > wrote:
> > > >> I guess this question was igored, so I just post it again.
> > > >>
> > > >> From my understanding, HDFS uses a single thread to do read and
> write.
> > > >> Since a file is composed of many blocks and each block is stored as
> a
> > > file
> > > >> in the underlying FS, we can do some parallelism on block base.
> > > >> When read across multi-blocks, threads can be used to read all
> blocks.
> > > When
> > > >> write, we can calculate the offset of each block and write to all of
> > > them
> > > >> simultaneously.
> > > >>
> > > >> Is this right?
> > > >>
> > > >
> > > >
> > > > The information contained in this communication may be CONFIDENTIAL
> and
> > > is intended only for the use of the recipient(s) named above.  If you
> are
> > > not the intended recipient, you are hereby notified that any
> > dissemination,
> > > distribution, or copying of this communication, or any of its contents,
> > is
> > > strictly prohibited.  If you have received this communication in error,
> > > please notify the sender and delete/destroy the original message and
> any
> > > copy of it from your computer or paper files.
> > > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Why single thread for HDFS?

Reply via email to