Brian, I saw that Stuart here<http://stuartsierra.com/2008/04/24/a-million-little-files>mentions slow writes to SequenceFile. If so, I will either use his tar approach or try to parallelize it if I can.
On Tue, Feb 10, 2009 at 11:14 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote: > > On Feb 10, 2009, at 11:09 PM, Mark Kerzner wrote: > > Brian, large files using command-line hadoop go fast, so it is something >> about my computer or network. I won't worry about this now, especially in >> light of Amit reporting fast writes and reads. >> > > You're creating files using SequenceFile, right? It might be that the > creation of the sequence file is the portion which is slow, not the network > I/O. > > I don't have much knowledge about optimization of SequenceFile creation. I > assume that you'll want to start by tweaking compression on and off. > Additionally, Jeff (I think) pointed to a Hadoop Archive file, which also > might be an alternative for your system. I don't know enough to give you a > set of pros and cons, just enough to mention it as an alternative to > experiment with. > > Sorry I'm not useful here... > > Brian > > > >> >> Mark >> >> On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman <bbock...@cse.unl.edu >> >wrote: >> >> >>> On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote: >>> >>> Brian, I have a similar question: why does transfer from a local >>> >>>> filesystem >>>> to SequenceFile takes so long (about 1 second per Meg)? >>>> >>>> >>> Hey Mark, >>> >>> I saw your question about speed the other day ... unfortunately, I didn't >>> have any specific advice so I stayed quiet :) >>> >>> In a correctly configured cluster, performance is mostly limited by >>> available hardware. If it's obvious that performance is well below >>> hardware >>> limits (such as in your case), it's usually (a) you're not generating >>> files >>> fast enough or (b) something is configured wrong. >>> >>> Have you just tried hadoop fs -put .... for some large file hanging >>> around >>> locally? If that doesn't go more than 5MB/s or so (when your hardware >>> can >>> obviously do such a rate), then there's probably a configuration issue. >>> >>> Brian >>> >>> >>> >>> Thank you, >>>> Mark >>>> >>>> On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbock...@cse.unl.edu >>>> >>>>> wrote: >>>>> >>>> >>>> >>>> On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote: >>>>> >>>>> Hi, >>>>> >>>>> Could someone help me to find some real Figures (transfer rate) about >>>>>> Hadoop File transfer from local filesystem to HDFS, S3 etc and among >>>>>> Storage Systems (HDFS to S3 etc) >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Wasim >>>>>> >>>>>> >>>>>> What are you looking for? Maximum possible transfer rate? Maximum >>>>> possible transfer rate per client? Generally, if you're using the Java >>>>> client, transfer rate to/from HDFS is limited by the hardware you have >>>>> and >>>>> the network connection (if you have 1Gbps per client). >>>>> >>>>> I could give you a graph showing a peak of 9Gbps from our Hadoop >>>>> instance >>>>> to the WAN, but that's not very interesting if you don't have a 10Gbps >>>>> pipe... >>>>> >>>>> Brian >>>>> >>>>> >>>>> >>>>> >>> >