Just while I am on this thread of performance I thought I would continue...
I am moving a 200 million record table (mostly varchar) from a myisam mysql database to hadoop. First map reduce tests seem pretty positive. A full table scan in mysql takes 17 minutes on a $20,000 pretty heavyweight 64G memory server (sorry don't know exact CPU / Disk specs offhand), with a count that can use an indexed INTEGER (index pegged in memory) taking 3 minutes. Exporting the table and importing to HDFS as a tab delimited file (100GB), on the above cluster, the same scans take 4 minutes for a custom MR job. Actually I have 2x 200 million record tables, and a full scan requiring a join of both those tables in mysql is around 30 minutes, but I joined them in the export before the import into hadoop. So some of my ad hoc reports have just come down from 30 mins to 4 mins (and I will use Hive for this). Tim > Mon, Oct 19, 2009 at 4:14 PM, Usman Waheed <[email protected]> wrote: > Tim, > > I have 4 nodes (Quad Core 2.00Ghz, 8GB RAM, 4x1TB disks), where one is the > master+datanode and the rest are datanodes. > > Job: Sort 40GB of random data > > With the following current configuration setting: > > io.sort.factor: 10 > io.sort.mb: 100 > io.file.buffer.size: 65536 > mapred.child.java.opts: -Xmx200M > dfs.datanode.handler.count=3 > 2 Mappers > 2 Reducer > Time taken: 28 minutes > > Still testing with more config changes, will send the results out. > > -Usman > > > >> Hi all, >> >> I thought I would post the findings of my tuning tests running the >> sort benchmark. >> >> This is all based on 10 machines (1 as masters and 9 DN/TT) each of: >> Dell R300: 2.83G Quadcore (2x6MB cache 1 proc), 8G RAM and 2x500G SATA >> drives >> >> --- Vanilla installation --- >> 2M 2R: 36 mins >> 4M 4R: 36 mins (yes the same) >> >> >> --- Tuned according to Cloudera http://tinyurl.com/ykupczu --- >> io.sort.factor: 20 (mapred-site.xml) >> io.sort.mb: 200 (mapred-site.xml) >> io.file.buffer.size: 65536 (core-site.xml) >> mapred.child.java.opts: -Xmx512M (mapred-site.xml) >> >> 2M 2R: 33.5 mins >> 4M 4R: 29 mins >> 8M 8R: 41 mins >> >> >> --- Increasing the task memory a little --- >> io.sort.factor: 20 >> io.sort.mb: 200 >> io.file.buffer.size: 65536 >> mapred.child.java.opts: -Xmx1G >> >> 2M 2R: 29 mins (adding dfs.datanode.handler.count=8 resulted in 30 mins) >> 4M 4R: 29 mins (yes the same) >> >> >> --- Increasing sort memory --- >> io.sort.factor: 32 >> io.sort.mb: 320 >> io.file.buffer.size: 65536 >> mapred.child.java.opts: -Xmx1G >> >> 2M 2R: 31 mins (yes longer than lower sort sizes) >> >> I am going to stick with the following for now and get back to work... >> io.sort.factor: 20 >> io.sort.mb: 200 >> io.file.buffer.size: 65536 >> mapred.child.java.opts: -Xmx1G >> dfs.datanode.handler.count=8 >> 4 Mappers >> 4 Reducer >> >> Hope that helps someone. How did your tuning go Usman? >> >> Tim >> >> >> On Fri, Oct 16, 2009 at 10:41 PM, tim robertson >> <[email protected]> wrote: >> >>> >>> No worries Usman, I will try and do the same on Monday. >>> >>> Thanks Todd for the clarification. >>> >>> Tim >>> >>> >>> On Fri, Oct 16, 2009 at 5:30 PM, Usman Waheed <[email protected]> wrote: >>> >>>> >>>> Hi Tim, >>>> >>>> I have been swamped with some other stuff so did not get a chance to run >>>> further tests on my setup. >>>> Will send them out early next week so we can compare. >>>> >>>> Cheers, >>>> Usman >>>> >>>> >>>>> >>>>> On Fri, Oct 16, 2009 at 4:01 AM, tim robertson >>>>> <[email protected]>wrote: >>>>> >>>>> >>>>> >>>>>> >>>>>> Hi all, >>>>>> >>>>>> Adding the following to core-site.xml, mapred-site.xml and >>>>>> hdfs-site.xml (based on Cloudera guidelines: >>>>>> http://tinyurl.com/ykupczu) >>>>>> io.sort.factor: 15 (mapred-site.xml) >>>>>> io.sort.mb: 150 (mapred-site.xml) >>>>>> io.file.buffer.size: 65536 (core-site.xml) >>>>>> dfs.datanode.handler.count: 3 (hdfs-site.xml actually this is the >>>>>> default) >>>>>> >>>>>> and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh) >>>>>> >>>>>> Using 2 mappers and 2 reducers, can someone please help me with the >>>>>> maths as to why my jobs are failing with "Error: Java heap space" in >>>>>> the maps? >>>>>> (the same runs fine with io.sort.factor of 10 and io.sort.mb of 100) >>>>>> >>>>>> io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G >>>>>> Plus the 2 daemons on the node at 1G each = 1.8G >>>>>> Plus Xmx of 1G for each hadoop daemon task = 5.8G >>>>>> >>>>>> The machines have 8G in them. Obviously my maths is screwy >>>>>> somewhere... >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> Hi Tim, >>>>> >>>>> Did you also change mapred.child.java.opts? The HADOOP_HEAPSIZE >>>>> parameter >>>>> is >>>>> for the daemons, not the tasks. If you bump up io.sort.mb you also have >>>>> to >>>>> bump up the -Xmx argument in mapred.child.java.opts to give the actual >>>>> tasks >>>>> more RAM. >>>>> >>>>> -Todd >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> On Thu, 15 Oct 2009 11:32:35 +0200 >>>>>>> Usman Waheed <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Hi Todd, >>>>>>>> >>>>>>>> Some changes have been applied to the cluster based on the >>>>>>>> documentation (URL) you noted below, >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> I would also like to know what settings people are tuning on the >>>>>>> operating system level. The blog post mentioned here does not mention >>>>>>> much about that, except for the fileno changes. >>>>>>> >>>>>>> We got about 3x the read performance when running DFSIOTest by >>>>>>> mounting >>>>>>> our ext3 filesystems with the noatime parameter. I saw that mentioned >>>>>>> in the slides from some Cloudera presentation. >>>>>>> >>>>>>> (For those who don't know, the noatime parameter turns off the >>>>>>> recording of access time on files. That's a horrible performance >>>>>>> killer >>>>>>> since it means every read of a file also means that the kernel must >>>>>>> do >>>>>>> a write. These writes are probably queued up, but still, if you don't >>>>>>> need the atime (very few applications do), turn it off!) >>>>>>> >>>>>>> Have people been experimenting with different filesystems, or are >>>>>>> most >>>>>>> of us running on top of ext3? >>>>>>> >>>>>>> How about mounting ext3 with "data=writeback"? That's rumoured to >>>>>>> give >>>>>>> the best throughput and could help with write performance. From >>>>>>> mount(8): >>>>>>> >>>>>>> writeback >>>>>>> Data ordering is not preserved - data may be written into >>>>>>> the >>>>>>> >>>>>>> >>>>>> >>>>>> main file system >>>>>> >>>>>> >>>>>>> >>>>>>> after its metadata has been committed to the journal. >>>>>>> This >>>>>>> >>>>>>> >>>>>> >>>>>> is rumoured to be the >>>>>> >>>>>> >>>>>>> >>>>>>> highest throughput option. It guarantees internal file >>>>>>> system >>>>>>> >>>>>>> >>>>>> >>>>>> integrity, >>>>>> >>>>>> >>>>>>> >>>>>>> however it can allow old data to appear in files after a >>>>>>> crash >>>>>>> >>>>>>> >>>>>> >>>>>> and journal recovery. >>>>>> >>>>>> >>>>>>> >>>>>>> How would the HDFS consistency checks cope with old data appearing in >>>>>>> the unerlying files after a system crash? >>>>>>> >>>>>>> Cheers, >>>>>>> \EF >>>>>>> -- >>>>>>> Erik Forsberg <[email protected]> >>>>>>> Developer, Opera Software - http://www.opera.com/ >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> >> >> > >
