scaling experiments on a static cluster?
Hi Hadoop mavens- I'm hoping someone out there will have a quick solution for me. I'm trying to run some very basic scaling experiments for a rapidly approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes with 2 procs/node. Ideally, I would want to run my code on clusters of different numbers of nodes (1, 2, 4, 8, 16) or some such thing. The problem is that I am not able to reconfigure the cluster (in the long run, i.e., before a final version of the paper, I assume this will be possible, but for now it's not). Setting the number of mappers/reducers does not seem to be a viable option, at least not in the trivial way, since the physical layout of the input files makes hadoop run different tasks of processes than I may request (most of my jobs consist of multiple MR steps, the initial one always running on a relatively small data set, which fits into a single block, and therefore the Hadoop framework does honor my task number request on the first job-- but during the later ones it does not). My questions: 1) can I get around this limitation programmatically? I.e., is there a way to tell the framework to only use a subset of the nodes for DFS / mapping / reducing? 2) if not, what statistics would be good to report if I can only have two data points -- a legacy single-core implementation of the algorithms and a MapReduce version running on a cluster full cluster? Thanks for any suggestions! Chris
Re: scaling experiments on a static cluster?
Thanks-- that should work. I'll follow up with the cluster administrators to see if I can get this to happen. To rebalance the file storage can I just set the replication factor using hadoop dfs? Chris On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning [EMAIL PROTECTED] wrote: What about just taking down half of the nodes and then loading your data into the remainder? Should take about 20 minutes each time you remove nodes but only a few seconds each time you add some. Remember that you need to reload the data each time (or rebalance it if growing the cluster) to get realistic numbers. My suggested procedure would be to take all but 2 nodes down, and then - run test - double number of nodes - rebalance file storage - lather, rinse, repeat On 3/12/08 3:28 PM, Chris Dyer [EMAIL PROTECTED] wrote: Hi Hadoop mavens- I'm hoping someone out there will have a quick solution for me. I'm trying to run some very basic scaling experiments for a rapidly approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes with 2 procs/node. Ideally, I would want to run my code on clusters of different numbers of nodes (1, 2, 4, 8, 16) or some such thing. The problem is that I am not able to reconfigure the cluster (in the long run, i.e., before a final version of the paper, I assume this will be possible, but for now it's not). Setting the number of mappers/reducers does not seem to be a viable option, at least not in the trivial way, since the physical layout of the input files makes hadoop run different tasks of processes than I may request (most of my jobs consist of multiple MR steps, the initial one always running on a relatively small data set, which fits into a single block, and therefore the Hadoop framework does honor my task number request on the first job-- but during the later ones it does not). My questions: 1) can I get around this limitation programmatically? I.e., is there a way to tell the framework to only use a subset of the nodes for DFS / mapping / reducing? 2) if not, what statistics would be good to report if I can only have two data points -- a legacy single-core implementation of the algorithms and a MapReduce version running on a cluster full cluster? Thanks for any suggestions! Chris
Re: runtime exceptions not killing job
I've noticed this behavior as well in 16.0 with RuntimeExceptions in general. Chris On Mon, Mar 17, 2008 at 6:14 PM, Matt Kent [EMAIL PROTECTED] wrote: I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1, if a map or reduce task threw a runtime exception such as an NPE, the task, and ultimately the job, would fail in short order. I was running on job on my local 0.16.1 cluster today, and when the reduce tasks started throwing NPEs, the tasks just hung. Eventually they timed out and were killed, but is this expected behavior in 0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown. Matt -- Matt Kent Co-Founder Persai 1221 40th St #113 Emeryville, CA 94608 [EMAIL PROTECTED]
using a set of MapFiles - getting the right partition
Hi all-- I would like to have a reducer generate a MapFile so that in later processes I can look up the values associated with a few keys without processing an entire sequence file. However, if I have N reducers, I will generate N different map files, so to pick the right map file I will need to use the same partitioner as was used when partitioning the keys to reducers (the reducer I have running emits one value for each key it receives and no others). Should this be done manually, ie something like readers[partioner.getPartition(...)] or is there another recommended method? Eventually, I'm going to migrate to using HBase to store the key/value pairs (since I'd to take advantage of HBase's ability to cache common pairs in memory for faster retrieval), but I'm interested in seeing what the performance is like just using MapFiles. Thanks, Chris
Re: Getting map ouput as final output by setting number of reduce to zero
Setting the number of reducers to zero has the advantage that no sorting of the intermediate values will occur, which can save considerable amounts of time if the output from the mapper is large. Chris On Wed, Apr 30, 2008 at 6:47 PM, jkupferman [EMAIL PROTECTED] wrote: You actually dont need to do that, you can just use an IdentityReducer. The IdentityReducer simply acts as a pass through, so whatever is written out from your mapper, is passed through it and then written out to file. vibhooti wrote: Has any one tried setting number of reduce to zero and getting map's output as the final output? I tried doing the same but my map output does not come to specified output path for mapred. let me know if someone has already done that. I am not able to find out, where my map outputs are written. http://hadoop.apache.org/core/docs/r0.16.3/mapred_tutorial.html say the following Reducer NONE It is legal to set the number of reduce-tasks to *zero* if no reduction is desired. In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path)http://hadoop.apache.org/core/docs/r0.16.3/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath%28org.apache.hadoop.fs.Path%29. The framework does not sort the map-outputs before writing them out to the FileSystem. -- cheers, Vibhooti -- View this message in context: http://www.nabble.com/Getting-map-ouput-as-final-output-by-setting-number-of-reduce-to-zero-tp16847897p16992742.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Know how many records remain?
Qin's question actually raises an issue-- it seems that using a close() call, which does not throw IOException and which does not provide the user with access to the OutputCollector object makes this important piece of functionality (from a client's perspective) hard to use. Does anyone feel strongly about altering the contract so that close() throws IOException and provides the implementer with the OutputCollector object? On Wed, Aug 20, 2008 at 1:43 PM, Qin Gao [EMAIL PROTECTED] wrote: Thanks Chris, that's exactly what I am trying to do. It solves my problem. On Wed, Aug 20, 2008 at 4:36 PM, Chris Dyer [EMAIL PROTECTED] wrote: Qin, since I can guess what you're trying to do with this (emit a bunch of expected counts at the end of EM?), you can write output during the call to close(). It involves having to store the output collector object as a member of the class, but this is a way to do a final flush on the object before it is destroyed. Chris On Wed, Aug 20, 2008 at 7:02 PM, Qin Gao [EMAIL PROTECTED] wrote: Hi mailing, Are there any way to know whether the mapper is processing the last record that assigned to this node, or know how many records remain to be processed in this node? Qin
Serving contents of large MapFiles/SequenceFiles from memory across many machines
Hi all- One more question. I'm looking for a lightweight way to serve data stored as key-value pairs in a series of MapFiles or SequenceFiles. HBase/Hypertable offer a very robust, powerful solution to this problem with a bunch of extra features like updates and column types, etc., that I don't need at all. But, I'm wondering if there might be something ultra-lightweight that someone has come up with for a very restricted (but important!) set of use cases. Basically, I'd like to be able to load the entire contents of a file key-value map file in DFS into memory across many machines in my cluster so that I can access any of it with ultra-low latencies. I don't need updates--I just need ultra-fast queries into a very large hash map (actually, just an array would be sufficient). This would correspond, approximately to the sstable functionality that BigTable is implemented on top of, but which is also useful for many, many things directly (refer to the BigTable paper or http://www.techworld.com/storage/features/index.cfm?featureid=3183). This question may be better targeted to the HBase community, if so, please let me know. Has anyone else tried to deal with this? Thanks-- Chris
Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines
Memcached looks like it would be a reasonable solution for my problem, although it's not optimal since it doesn't support an easy way of initializing itself at start up, but I can work around that. This may be wishful thinking, but does anyone have any experience using the Hadoop job/task framework to launch supporting tasks like a memcached processes? Is there anyone else thinking about issues of scheduling other kinds of tasks (other than mappers and reducers) in Hadoop? Thanks-- Chris On Fri, Sep 19, 2008 at 2:53 PM, Alex Feinberg [EMAIL PROTECTED] wrote: Do any of CouchDB/Cassandra/other frameworks specifically do in-memory serving? I haven't found any that do this explicitly. For now I've been using memcached for that functionality (with the usual memcached caveats). Ehcache may be another memcache-like solution (http://ehcache.sourceforge.net/), but it also provides an on-disk storage in addition to in-memory (thus avoiding the if a machine goes down, data is lost issue of memcached). On Fri, Sep 19, 2008 at 10:54 AM, James Moore [EMAIL PROTECTED] wrote: On Wed, Sep 17, 2008 at 10:05 PM, Chris Dyer [EMAIL PROTECTED] wrote: I'm looking for a lightweight way to serve data stored as key-value pairs in a series of MapFiles or SequenceFiles. Might be worth taking a look at CouchDB as well. Haven't used it myself, so can't comment on how it might work for what you're describing. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com -- Alex Feinberg Platform Engineer, SocialMedia Networks
streaming silently failing when executing binaries with unresolved dependencies
Hi all- I am using streaming with some c++ mappers and reducers. One of the binaries I attempted to run this evening had a dependency on a shared library that did not exist on my cluster, so it failed during execution. However, the streaming framework didn't appear to recognize this failure, and the job tracker indicated that the mapper returned success, but did not produce any results. Has anyone else encountered this issue? Should I open a JIRA issue about this? I'm using Hadoop-17.2 Thanks- Chris
Re: Can anyone recommend me a inter-language data file format?
I've been using protocol buffers to serialize the data and then encoding them in base64 so that I can then treat them like text. This obviously isn't optimal, but I'm assuming that this is only a short term solution which won't be necessary when non-Java clients become first class citizens of the Hadoop world. Chris On Mon, Nov 3, 2008 at 2:24 PM, Pete Wyckoff [EMAIL PROTECTED] wrote: Protocol buffers, thrift? On 11/3/08 4:07 AM, Steve Loughran [EMAIL PROTECTED] wrote: Zhou, Yunqing wrote: embedded database cannot handle large-scale data, not very efficient I have about 1 billion records. these records should be passed through some modules. I mean a data exchange format similar to XML but more flexible and efficient. JSON CSV erlang-style records (name,value,value,value) RDF-triples in non-XML representations For all of these, you need to test with data that includes things like high unicode characters, single and double quotes, to see how well they get handled. you can actually append with XML by not having opening/closing tags, just stream out the entries to the tail of the file entry.../entry To read this in an XML parser, include it inside another XML file: ?xml version=1.0? !DOCTYPE log [ !ENTITY log SYSTEM log.xml ] file log; /file I've done this for very big files, as long as you aren't trying to load it in-memory to a DOM, things should work -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Counters missing from jobdetails.jsp page?
I've noticed that the counters sometimes disappear from jobdetails page, but the job context object that is returned at the end of the job always seems to have the correct values for my counters, which has been my primary concern... -Chris On Sat, Nov 22, 2008 at 3:26 PM, Arthur van Hoff [EMAIL PROTECTED] wrote: Does the absence of any response mean that I am the only one experiencing this? On Thu, Nov 20, 2008 at 5:24 PM, Arthur van Hoff [EMAIL PROTECTED] wrote: Hi, My map job has some user defined counters, these are displayed correctly after the job is finished, but while the job is running they only show up intermittently on the jobdetails.jsp page. Occasionally I do see them, but often they are missing on the the first map job is finished. Is this is a known issue? Is there a way to programmatically flush the counters every now and then, so that the jobtracker page displays intermediate results for long running jobs? Thanks. -- Arthur van Hoff - Grand Master of Alphabetical Order The Ellerdale Project, Menlo Park, CA [EMAIL PROTECTED], 650-283-0842 -- Arthur van Hoff - Grand Master of Alphabetical Order The Ellerdale Project, Menlo Park, CA [EMAIL PROTECTED], 650-283-0842
Re: Run Map-Reduce multiple times
Hey Delip- mapreduce doesn't really have any particular support for iterative algorithms. You just have to put a loop in the control program and set the output path from the previous iteration to be the input path in the next iteration. This at least lets you control whether you decide to keep around results of intermediate iterations or erase them... -Chris On Mon, Dec 8, 2008 at 1:25 AM, Delip Rao [EMAIL PROTECTED] wrote: Hi, I need to run my map-reduce routines for several iterations so that the output of an iteration becomes the input to the next iteration. Is there a standard pattern to do this instead of calling JobClient.runJob() in a loop? Thanks, Delip