>There's actually an open ticket somewhere to make distcp do this using the >new concat() API in the NameNode. Where can I find that "open ticket"?
>concat() allows several files to be combined into one file at the metadata level, so long as a number of >restrictions are met. The work hasn't been done yet, but the concat() call is there and waiting for a user. Well, this sounds good when you have many small files, you concat() them into a big one. I am talking about split a big file into blocks and copy all a few blocks in parallel. On Tue, Jul 6, 2010 at 3:01 AM, Todd Lipcon <[email protected]> wrote: > On Mon, Jul 5, 2010 at 5:08 AM, elton sky <[email protected]> wrote: > > > Segel, Jay > > Thanks for reply! > > > > >Your parallelism comes from multiple tasks running on different nodes > > within the cloud. By >default you get one map/reduce job per block. You > can > > write your own splitter to increase >this and then get more parallelism. > > sounds like an elegant solution. We can modify the 'distcp', using a > simple > > MR job, make it based on block rather than file. > > > > There's actually an open ticket somewhere to make distcp do this using the > new concat() API in the NameNode. concat() allows several files to be > combined into one file at the metadata level, so long as a number of > restrictions are met. The work hasn't been done yet, but the concat() call > is there and waiting for a user. > > -Todd > > > > > > >in practice, you very rarely know how big your output is going to be > > before > > it's produced, so >this doesn't really work > > I think you got the point why Yahoo make this design descision. > > Multithreading only applicable when you know the size of the file, like > > copy > > existing files, so you can split them and feed to different threads. > > > > On Sat, Jul 3, 2010 at 1:24 AM, Jay Booth <[email protected]> wrote: > > > > > Yeah, a good way to think of it is that parallelism is achieved at the > > > application level. > > > > > > On the input side, you can process multiple files in parallel or one > > > file in parallel by logically splitting and opening multiple readers > > > of the same file at multiple points. Each of these readers is single > > > threaded, because, well, you're returning a stream of bytes in order. > > > It's inherently serial. > > > > > > On the reduce side, multiple reduces run, writing to multiple files in > > > the same directory. Again, you can't really write to a single file in > > > parallel effectively -- you can't write byte 26 before byte 25, > > > because the file's not that long yet. > > > > > > Theoretically, maybe you could have all reduces write to the same file > > > by allocating some amount of space ahead of time and writing to the > > > blocks in parallel - in practice, you very rarely know how big your > > > output is going to be before it's produced, so this doesn't really > > > work. Multiple files in the same directory achieves the same goal > > > much more elegantly, without exposing a bunch of internal details of > > > the filesystem to user space. > > > > > > Does that make sense? > > > > > > > > > > > > On Fri, Jul 2, 2010 at 9:26 AM, Segel, Mike <[email protected]> wrote: > > > > Actually they also listen here and this is a basic question... > > > > > > > > I'm not an expert, but how does having multiple threads really help > > this > > > problem? > > > > > > > > I'm assuming you're talking about a map/reduce job and not some > > specific > > > client code which is being run on a client outside of the > > cloud/cluster.... > > > > > > > > I wasn't aware that you could easily synchronize threads running on > > > different JVMs. ;-) > > > > > > > > Your parallelism comes from multiple tasks running on different nodes > > > within the cloud. By default you get one map/reduce job per block. You > > can > > > write your own splitter to increase this and then get more parallelism. > > > > > > > > HTH > > > > > > > > -Mike > > > > > > > > > > > > -----Original Message----- > > > > From: Hemanth Yamijala [mailto:[email protected]] > > > > Sent: Friday, July 02, 2010 2:56 AM > > > > To: [email protected] > > > > Subject: Re: Why single thread for HDFS? > > > > > > > > Hi, > > > > > > > > Can you please post this on [email protected] ? I suspect > the > > > > most qualified people to answer this question would all be on that > > > > list. > > > > > > > > Hemanth > > > > > > > > On Fri, Jul 2, 2010 at 11:43 AM, elton sky <[email protected]> > > > wrote: > > > >> I guess this question was igored, so I just post it again. > > > >> > > > >> From my understanding, HDFS uses a single thread to do read and > write. > > > >> Since a file is composed of many blocks and each block is stored as > a > > > file > > > >> in the underlying FS, we can do some parallelism on block base. > > > >> When read across multi-blocks, threads can be used to read all > blocks. > > > When > > > >> write, we can calculate the offset of each block and write to all of > > > them > > > >> simultaneously. > > > >> > > > >> Is this right? > > > >> > > > > > > > > > > > > The information contained in this communication may be CONFIDENTIAL > and > > > is intended only for the use of the recipient(s) named above. If you > are > > > not the intended recipient, you are hereby notified that any > > dissemination, > > > distribution, or copying of this communication, or any of its contents, > > is > > > strictly prohibited. If you have received this communication in error, > > > please notify the sender and delete/destroy the original message and > any > > > copy of it from your computer or paper files. > > > > > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
