Re: Pregel

2009-06-26 Thread Owen O'Malley
On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote: my guess, as good as anybody's, is that Pregel is to large graphs is what Hadoop is to large datasets. I think it is much more likely a language that allows you to easily define fixed point algorithms. I would imagine a distributed

Re: A simple performance benchmark for Hadoop, Hive and Pig

2009-06-19 Thread Owen O'Malley
On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao zs...@facebook.com wrote: Yuntao Jia, our intern this summer, did a simple performance benchmark for Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A Comparison of Approaches to Large-Scale Data Analysis It should be noted

Re: Practical limit on emitted map/reduce values

2009-06-18 Thread Owen O'Malley
Keys and values can be large. They are certainly capped above by Java's 2GB limit on byte arrays. More practically, you will have problems running out of memory with keys or values of 100 MB. There is no restriction that a key/value pair fits in a single hdfs block, but performance would suffer.

Re: Practical limit on emitted map/reduce values

2009-06-18 Thread Owen O'Malley
On Jun 18, 2009, at 8:45 AM, Leon Mergen wrote: Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that is caused by the Java VM heap size ? If so, could that, for example, be increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ? A couple of points:

Re: multiple file input

2009-06-18 Thread Owen O'Malley
On Jun 18, 2009, at 10:56 AM, pmg wrote: Each line from FileA gets compared with every line from FileB1, FileB2 etc. etc. FileB1, FileB2 etc. are in a different input directory In the general case, I'd define an InputFormat that takes two directories, computes the input splits for each

Re: Anyway to sort keys before Reduce function in Hadoop ?

2009-06-17 Thread Owen O'Malley
On Wed, Jun 17, 2009 at 12:26 PM, Chuck Lam chuck@gmail.com wrote: an alternative is to create a new WritableComparator and then set it in the JobConf object with the method setOutputKeyComparatorClass(). You can use IntWritable.Comparator as a start. The important part of that is to

Re: MapContext.getInputSplit() returns nothing

2009-06-16 Thread Owen O'Malley
Sorry, I forget how much isn't clear to people who are just starting. FileInputFormat creates FileSplits. The serialization is very stable and can't be changed without breaking things. The reason that pipes can't stringify it is that the string form of input splits are ambiguous (and since

Re: MapContext.getInputSplit() returns nothing

2009-06-15 Thread Owen O'Malley
*Sigh* We need Avro for input splits. That is the expected behavior. It would be great if someone wrote a C+ + FileInputSplit class that took a binary string and converted it back to a filename, offset, and length. -- Owen

Re: speedy Google Maps driving directions like implementation

2009-06-15 Thread Owen O'Malley
On Jun 10, 2009, at 2:43 AM, Lukáš Vlček wrote: I am wondering how Google implemented the driving directions function in the Maps. More specifically how did they do it that it is so fast. I asked on Google engineer about this and all he told me is just that there are bunch of MapReduce

Re: Hadoop benchmarking

2009-06-10 Thread Owen O'Malley
Take a look at Arun's slide deck on Hadoop performance: http://bit.ly/EDCg3 It is important to get io.sort.mb large enough, the io.sort.factor should be closer to 100 instead of 10. I'd also use large block sizes to reduce the number of maps. Please see the deck for other important

Re: Image indexing/searching with Hadoop and MPI

2009-06-07 Thread Owen O'Malley
Ok I can understand your point - but I am sure that some people have been trying to use map-reduce programming model to do CFD, or any other scientific computing. Any experience in this area from the list ? I know of one project that assumes it has an entire Hadoop cluster, and generates the

Re: Fastlz coming?

2009-06-04 Thread Owen O'Malley
On Jun 4, 2009, at 11:19 AM, Kris Jirapinyo wrote: Hopefully we can have a new page on the hadoop wiki on how to use custom compression so that people won't have to go search through the threads to find the answer in the future. Yes, it would be extremely useful if you could start a wiki

Re: InputFormat for fixed-width records?

2009-05-28 Thread Owen O'Malley
On May 28, 2009, at 5:15 AM, Stuart White wrote: I need to process a dataset that contains text records of fixed length in bytes. For example, each record may be 100 bytes in length The update to the terasort example has an InputFormat that does exactly that. The key is 10 bytes and the

Re: Linking against Hive in Hadoop development tree

2009-05-20 Thread Owen O'Malley
On May 20, 2009, at 3:07 AM, Tom White wrote: Why does mapred depend on hdfs? MapReduce should only depend on the FileSystem interface, shouldn't it? Yes, I should have been consistent. In terms of compile-time dependences, mapred only depends on core. -- Owen

Re: Linking against Hive in Hadoop development tree

2009-05-20 Thread Owen O'Malley
On May 15, 2009, at 3:25 PM, Aaron Kimball wrote: Yikes. So part of sqoop would wind up in one source repository, and part in another? This makes my head hurt a bit. I'd say rather that Sqoop is in Mapred and the adapter to Hive is in Hive. I'm also not convinced how that helps.

Re: Beware sun's jvm version 1.6.0_05-b13 on linux

2009-05-18 Thread Owen O'Malley
On May 18, 2009, at 3:42 AM, Steve Loughran wrote: Presumably its one of those hard-to-reproduce race conditions that only surfaces under load on a big cluster so is hard to replicate in a unit test, right? Yes. It reliably happens on a 100TB or larger sort, but almost never happens on

Beware sun's jvm version 1.6.0_05-b13 on linux

2009-05-15 Thread Owen O'Malley
We have observed that the default jvm on RedHat 5 can cause significant data corruption in the map/reduce shuffle for those using Hadoop 0.20. In particular, the guilty jvm is: java version 1.6.0_05 Java(TM) SE Runtime Environment (build 1.6.0_05-b13) Java HotSpot(TM) Server VM (build

Re: Linking against Hive in Hadoop development tree

2009-05-15 Thread Owen O'Malley
On May 15, 2009, at 2:05 PM, Aaron Kimball wrote: In either case, there's a dependency there. You need to split it so that there are no cycles in the dependency tree. In the short term it looks like: avro: core: avro hdfs: core mapred: hdfs, core hive: mapred, core pig: mapred, core

Re: Access counters from within Reducer#configure() ?

2009-05-11 Thread Owen O'Malley
On May 11, 2009, at 8:48 AM, Stuart White wrote: I'd like to be able to access my job's counters from within my Reducer's configure() method (so I can know how many records were output from my mappers). Is this possible? Thanks! The counters aren't available and if they were the

Re: Is it possible to sort intermediate values and final values?

2009-05-07 Thread Owen O'Malley
On May 7, 2009, at 12:38 PM, Foss User wrote: Where can I find this example. I was not able to find it in the src/examples directory. It is in 0.20. http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/SecondarySort.java -- Owen

Re: What User Accounts Do People Use For Team Dev?

2009-05-05 Thread Owen O'Malley
Best is to use one user for map/reduce and another for hdfs. Neither of them should be root or real users. With the setuid patch (HADOOP-4490), it is possible to run the jobs as the submitted user. Note that if you do that, you no doubt want to block certain system uids (bin, mysql, etc.) -- Owen

Re: Implementing compareTo in user-written keys where one extends the other is error prone

2009-04-30 Thread Owen O'Malley
If you use custom key types, you really should be defining a RawComparator. It will perform much much better. -- Owen

Re: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887

2009-04-23 Thread Owen O'Malley
On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote: Nigel, When you have time, could you release 0.18.4 that contains some of the patches that make our clusters 'stable'? Is it just the patches that have already been applied to the 18 branch? Or are there more? -- Owen

Re: Is combiner and map in same JVM?

2009-04-14 Thread Owen O'Malley
On Apr 14, 2009, at 11:10 AM, Saptarshi Guha wrote: Thanks. I am using 0.19, and to confirm, the map and combiner (in the map jvm) are run in *different* threads at the same time? And the change was actually made in 0.18. So since then, the combiner is called 0, 1, or many times on each

Re: Multithreaded Reducer

2009-04-13 Thread Owen O'Malley
On Apr 10, 2009, at 11:12 AM, Sagar Naik wrote: Hi, I would like to implement a Multi-threaded reducer. As per my understanding , the system does not have one coz we expect the output to be sorted. However, in my case I dont need the output sorted. You'd probably want to make a blocking

Re: HDFS read/write speeds, and read optimization

2009-04-10 Thread Owen O'Malley
On Thu, Apr 9, 2009 at 9:30 PM, Brian Bockelman bbock...@cse.unl.edu wrote: On Apr 9, 2009, at 5:45 PM, Stas Oskin wrote: Hi. I have 2 questions about HDFS performance: 1) How fast are the read and write operations over network, in Mbps per second? Depends.  What hardware?  How much

Re: HDFS read/write speeds, and read optimization

2009-04-10 Thread Owen O'Malley
On Apr 10, 2009, at 9:07 AM, Stas Oskin wrote: From your experience, how RAM hungry HDFS is? Meaning, additional 4GB or ram (to make it 8GB aas in your case), really change anything? I don't think the 4 to 8GB would matter much for HDFS. For Map/Reduce, it is very important. -- Owen

Re: How many people is using Hadoop Streaming ?

2009-04-07 Thread Owen O'Malley
On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote: Owen, Is binary streaming actually readily available? https://issues.apache.org/jira/browse/HADOOP-1722

Re: problem running on a cluster of mixed hardware due to Incompatible buildVersion of JobTracker adn TaskTracker

2009-04-06 Thread Owen O'Malley
This was discussed over on: https://issues.apache.org/jira/browse/HADOOP-5203 Doug uploaded a patch, but no one seems to be working on it. -- Owen

Re: connecting two clusters

2009-04-06 Thread Owen O'Malley
On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that

Re: best practice: mapred.local vs dfs drives

2009-04-05 Thread Owen O'Malley
We always share the drives. -- Owen On Apr 5, 2009, at 0:52, zsongbo zson...@gmail.com wrote: I usually set mapred.local.dir to share the disk space with DFS, since some mapreduce job need big temp space. On Fri, Apr 3, 2009 at 8:36 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote:

Re: Users are not properly authenticated in Hadoop

2009-04-05 Thread Owen O'Malley
On Apr 5, 2009, at 3:30 AM, Foss User wrote: However, I see that anyone can easily impersonate as fossist Yes, it is easy to work around the security in Hadoop. It is only intended to prevent accidents, such as the time a student accidently wiped out the entire class' home directories.

Re: How many people is using Hadoop Streaming ?

2009-04-03 Thread Owen O'Malley
On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote: Has anyone benchmark the performance difference of using Hadoop ? 1) Java vs C++ 2) Java vs Streaming Yes, a while ago. When I tested it using sort, Java and C++ were roughly equal and streaming was 10-20% slower. Most of the cost with

Re: How many people is using Hadoop Streaming ?

2009-04-03 Thread Owen O'Malley
On Apr 3, 2009, at 10:35 AM, Ricky Ho wrote: I assume that the key is still sorted, right ? That mean I will get all the key1, valueX entries before getting any of the key2 valueY entries and key2 is always bigger than key1. Yes. -- Owen

Re: Hadoop/HDFS for scientific simulation output data analysis

2009-04-03 Thread Owen O'Malley
On Apr 3, 2009, at 1:41 PM, Tu, Tiankai wrote: By the way, what is the largest size---in terms of total bytes, number of files, and number of nodes---in your applications? Thanks. The largest Hadoop application that has been documented is the Yahoo Webmap. 10,000 cores 500 TB shuffle 300

Re: datanode but not tasktracker

2009-04-01 Thread Owen O'Malley
On Apr 1, 2009, at 12:58 AM, Sandhya E wrote: Hi When the host is listed in slaves file both DataNode and TaskTracker are started on that host. Is there a way in which we can configure a node to be datanode and not tasktracker. If you use hadoop-daemons.sh, you can pass a host list. So do:

Re: Reduce doesn't start until map finishes

2009-03-24 Thread Owen O'Malley
What happened is that we added fast start (HADOOP-3136), which launches more than one task per a heartbeat. Previously, if you maps didn't take very long, they finished before the heartbeat and the task tracker was assigned a new map task. A side effect was that no reduce tasks were launched until

ApacheCon EU 2009 this week

2009-03-23 Thread Owen O'Malley
includes the Hadoop track. The video of the keynote and lunch talks are free, but there is a charge for the Hadoop track. The Hadoop track this year includes: * Opening Keynote - Data Management in the Cloud - Raghu Ramakrishnan * Introduction to Hadoop - Owen O'Malley * Hadoop Map

Re: Sorting data numerically

2009-03-23 Thread Owen O'Malley
On Mar 23, 2009, at 9:15 AM, tim robertson wrote: If Akira was to write his/her own Mappers, using types like IntWritable would result in it being numerically sorted right? Yes. Or they can use the KeyFieldBasedComparator. I think if you put the following in your job conf, you'll get the

Re: Coordination between Mapper tasks

2009-03-19 Thread Owen O'Malley
On Mar 18, 2009, at 10:26 AM, Stuart White wrote: I'd like to implement some coordination between Mapper tasks running on the same node. I was thinking of using ZooKeeper to provide this coordination. This is a very bad idea in the general case. It can be made to work, but you need to

Re: Numbers of mappers and reducers

2009-03-17 Thread Owen O'Malley
On Mar 17, 2009, at 9:18 AM, Richa Khandelwal wrote: I was going through FAQs on Hadoop to optimize the performance of map/reduce. There is a suggestion to set the number of reducers to a prime number closest to the number of nodes and number of mappers a prime number closest to several

Re: Release batched-up output records at end-of-job?

2009-03-17 Thread Owen O'Malley
On Mar 17, 2009, at 5:53 AM, Stuart White wrote: Yeah, I thought of that, but I was concerned that, even if it did work, if it wasn't guaranteed behavior, that it might stop working in a future release. I'll go ahead and give that a try. No, it will continue to work. It is required for both

Re: Massive discrepancies in job's bytes written/read

2009-03-17 Thread Owen O'Malley
On Mar 17, 2009, at 7:44 PM, Bryan Duxbury wrote: There is no compression in the mix for us, so that's not the culprit. I'd be sort of willing to believe that spilling and sorting play a role in this, but, wow, over 10x read and write? That seems like a big problem. It happened recently

Re: Temporary files for mapppers and reducers

2009-03-15 Thread Owen O'Malley
Just use the current working directory. Each task gets a unique directory that is erased when the task finished. -- Owen On Mar 15, 2009, at 16:08, Mark Kerzner markkerz...@gmail.com wrote: Hi, what would be the best place to put temporary files for a reducer? I believe that since

Re: null value output from map...

2009-03-13 Thread Owen O'Malley
On Mar 13, 2009, at 3:56 PM, Richa Khandelwal wrote: You can initialize IntWritable with an empty constructor. IntWritable i=new IntWritable(); NullWritable is better for that application than IntWritable. It doesn't consume any space when serialized. *smile* -- Owen

Re: Does hadoop-default.xml + hadoop-site.xml matter for whole cluster or each node?

2009-03-09 Thread Owen O'Malley
On Mar 7, 2009, at 10:56 PM, pavelkolo...@gmail.com wrote: Does hadoop-default.xml + hadoop-site.xml of master host matter for whole Job or they matter for each node independently? Please never modify hadoop-default. That is for the system defaults. Please use hadoop-site for your

Re: question about released version id

2009-03-09 Thread Owen O'Malley
On Mar 2, 2009, at 11:46 PM, 鞠適存 wrote: I wonder how to make the hadoop version number. Each 0.18, 0.19 and 0.20 have their own branch. The first release on each branch is 0.X.0, and then 0.X.1 and so on. New features are only put into trunk and only important bug fixes are put into the

Re: Reducer goes past 100% complete?

2009-03-09 Thread Owen O'Malley
On Mar 9, 2009, at 1:00 PM, james warren wrote: Speculative execution has existed far before 0.19.x, but AFAIK the 100% issue has appeared (at least with greater frequency) since 0.19.0 came out. Are you saying there are changes in how task progress is being tracked? In the past, it

Re: MapReduce jobs with expensive initialization

2009-03-02 Thread Owen O'Malley
On Mar 2, 2009, at 3:03 AM, Tom White wrote: I believe the static singleton approach outlined by Scott will work since the map classes are in a single classloader (but I haven't actually tried this). Even easier, you should just be able to do it with static initialization in the Mapper

Re: Shuffle phase

2009-02-26 Thread Owen O'Malley
On Feb 26, 2009, at 2:03 PM, Nathan Marz wrote: Do the reducers batch copy map outputs from a machine? That is, if a machine M has 15 intermediate map outputs destined for machine R, will machine R copy the intermediate outputs one at a time or all at once? Currently, one at a time. In

Re: Batching key/value pairs to map

2009-02-23 Thread Owen O'Malley
On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan ji...@indeed.com wrote: part of my map/reduce process could be greatly sped up by mapping key/value pairs in batches instead of mapping them one by one. Can I safely hang onto my OutputCollector and Reporter from calls to map? Yes. You can even use

Re: Batching key/value pairs to map

2009-02-23 Thread Owen O'Malley
On Feb 23, 2009, at 2:19 PM, Jimmy Wan wrote: I'm not sure if this is possible, but it would certainly be nice to either: 1) pass the OutputCollector and Reporter to the close() method. 2) Provide accessors to the OutputCollector and the Reporter. If you look at the 0.20 branch, which

Re: Is there a way to tell whether you're in a map task or a reduce task?

2009-02-10 Thread Owen O'Malley
On Feb 10, 2009, at 5:20 PM, Matei Zaharia wrote: I'd like to write a combiner that shares a lot of code with a reducer, except that the reducer updates an external database at the end. The right way to do this is to either do the update in the output format or do something like: class

Re: only one reducer running in a hadoop cluster

2009-02-09 Thread Owen O'Malley
On Feb 7, 2009, at 11:52 PM, Nick Cen wrote: Hi, I hava a hadoop cluster with 4 pc. And I wanna to integrate hadoop and lucene together, so i copy some of the source code from nutch's Indexer class, but when i run my job, i found that there is only 1 reducer running on 1 pc, so the

Re: can't read the SequenceFile correctly

2009-02-09 Thread Owen O'Malley
On Feb 6, 2009, at 8:52 AM, Bhupesh Bansal wrote: Hey Tom, I got also burned by this ?? Why does BytesWritable.getBytes() returns non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes() kind of function. It does it because continually resizing the array to the valid length

Re: TaskTrackers being double counted after restart job recovery

2009-02-09 Thread Owen O'Malley
There is a bug that when we restart the TaskTrackers they get counted twice. The problem is the name is generated from the hostname and port number. When TaskTrackers restart they get a new port number and get counted again. The problem goes away when the old TaskTrackers time out in 10 minutes or

Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Owen O'Malley
On Jan 22, 2009, at 7:25 AM, Brian MacKay wrote: Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. The keys to reduce are *always* sorted. If the default order is not correct, you can change the compare function.

Re: Maven repo for Hadoop

2009-01-19 Thread Owen O'Malley
On Jan 17, 2009, at 5:53 PM, Chanwit Kaewkasi wrote: I would like to integrate Hadoop to my project using Ivy. Is there any maven repository containing Hadoop jars that I can point my configuration to? Not yet, but soon. We recently introduced ivy into Hadoop, so I believe we'll upload the

Re: Merging reducer outputs into a single part-00000 file

2009-01-14 Thread Owen O'Malley
On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote: Jim, As far as I know, there is no operation done after Reducer. Correct, other than output promotion, which moves the output file to the final filename. But if you are a little experienced, you already know these. Ordered list means one

Re: General questions about Map-Reduce

2009-01-11 Thread Owen O'Malley
On Jan 11, 2009, at 5:50 AM, tienduc_dinh wrote: - Does Map-Reduce support parallel writing/reading ? Yes. The maps all read in parallel with each other and the reduces all write in parallel with each other. - What happens after the Map-Reduce operation ? The OutputFormat usually

Re: General questions about Map-Reduce

2009-01-11 Thread Owen O'Malley
On Jan 11, 2009, at 7:05 PM, tienduc_dinh wrote: Is there any article which describes it ? Please read the map/reduce tutorial: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html -- Owen

Re: OutputCollector to print only keys

2009-01-08 Thread Owen O'Malley
On Jan 8, 2009, at 4:13 AM, Sandhya E wrote: Does Hadoop 0.18 have an outputcollector that can print out only the keys and supress the values. In our case we need only the keys, and hence we do output.collect(key, blank) and then reprocess the entire output file to remove trailing tabs. If

Re: OutputCollector to print only keys

2009-01-08 Thread Owen O'Malley
On Jan 8, 2009, at 4:13 AM, Sandhya E wrote: Does Hadoop 0.18 have an outputcollector that can print out only the keys and supress the values. In our case we need only the keys, and hence we do output.collect(key, blank) and then reprocess the entire output file to remove trailing tabs. If

Re: correct pattern for using setOutputValueGroupingComparator?

2009-01-05 Thread Owen O'Malley
This is exactly what the setOutputValueGroupingComparator is for. Take a look at HADOOP-4545, for an example using the secondary sort. If you are using trunk or 0.20, look at src/examples/org/apache/hadoop/ examples/SecondarySort.java. The checked in example uses the new map/ reduce api that

Re: Map input records(on JobTracker website) increasing and decreasing

2009-01-05 Thread Owen O'Malley
Yeah, I've seen this too. I just filed HADOOP-4983. -- Owen

Re: Output.collect uses toString for custom key class. Is it possible to change this?

2008-12-17 Thread Owen O'Malley
On Dec 16, 2008, at 9:30 AM, David Coe wrote: Since the SequenceFileOutputFormat doesn't like nulls, how would I use NullWritable? Obviously output.collect(key, null) isn't working. If I change it to output.collect(key, new IntWritable()) I get the result I want (plus an int that I

Re: Output.collect uses toString for custom key class. Is it possible to change this?

2008-12-16 Thread Owen O'Malley
On Dec 16, 2008, at 8:28 AM, David Coe wrote: Is there a way to output the write() instead? Use SequenceFileOutputFormat. It writes binary files using the write. The reverse is SequenceFileInputFormat, which reads the sequence files using readFields. -- Owen

Re: [video] visualization of the hadoop code history

2008-12-16 Thread Owen O'Malley
On Dec 16, 2008, at 12:36 AM, Stefan Groschupf wrote: It is a neat way of visualizing who is behind the Hadoop source code and how the project code base grew over the years. It is interesting, but it would be more interesting to track the authors of the patch rather than the committer.

Re: Output.collect uses toString for custom key class. Is it possible to change this?

2008-12-16 Thread Owen O'Malley
On Dec 16, 2008, at 9:14 AM, David Coe wrote: Does the SequenceFileOutputFormat work with NullWritable as the value? Yes.

Re: concurrent modification of input files

2008-12-15 Thread Owen O'Malley
On Dec 15, 2008, at 8:08 AM, Sandhya E wrote: I have a scenario where, while a map/reduce is working on a file, the input file may get deleted and copied with a new version of the file. All my files are compressed and hence each file is worked on by a single node. I tried simulating the

Re: Simple data transformations in Hadoop?

2008-12-14 Thread Owen O'Malley
On Dec 13, 2008, at 6:32 PM, Stuart White wrote: First question: would Hadoop be an appropriate tool for something like this? Very What is the best way to model this type of work in Hadoop? As a map-only job with number of reduces = 0. I'm thinking my mappers will accept a Long key

Re: Suggestions of proper usage of key parameter ?

2008-12-14 Thread Owen O'Malley
On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote: Yes, I am referring to the key INPUT INTO the map() function and the key EMITTED FROM the reduce() function. Can someone explain why do we need a key in these cases and what is the proper use of it ? It was a design choice and could have

Re: Reset hadoop servers

2008-12-09 Thread Owen O'Malley
On Dec 9, 2008, at 2:22 AM, Devaraj Das wrote: I know that the tasktracker/jobtracker doesn't have any command for re-reading the configuration. There is built-in support for restart/shut-down but those are via external scripts that internally do a kill/start. Actually, HADOOP-4348 does

Re: End of block/file for Map

2008-12-09 Thread Owen O'Malley
On Dec 9, 2008, at 11:35 AM, Songting Chen wrote: Is there a way for the Map process to know it's the end of records? I need to flush some additional data at the end of the Map process, but wondering where I should put that code. The close() method is called at the end of the map. --

Re: End of block/file for Map

2008-12-09 Thread Owen O'Malley
On Dec 9, 2008, at 7:34 PM, Aaron Kimball wrote: That's true, but you should be aware that you no longer have an OutputCollector available in the close() method. True, but in practice you can keep a handle to it from the map method and it will work perfectly. This is required for both

Re: How are records with equal key sorted in hadoop-0.18?

2008-12-08 Thread Owen O'Malley
On Dec 8, 2008, at 8:02 AM, Christian Kunz wrote: Comparing hadoop-default.xml of hadoop-0.18 with hadoop-0.17, didn't map.sort.class change from org.apache.hadoop.mapred.MergeSorter to org.apache.hadoop.util.QuickSort? Yes, but the quick sort is only used in the mapper. The reducer already

Re: getting Configuration object in mapper

2008-12-05 Thread Owen O'Malley
On Dec 4, 2008, at 9:19 PM, abhinit wrote: I have set some variable using the JobConf object. jobConf.set(Operator, operator) etc. How can I get an instance of Configuration object/ JobConf object inside a map method so that I can retrieve these variables. In your Mapper class,

Re: Hadoop Internal Architecture writeup

2008-11-28 Thread Owen O'Malley
On Nov 28, 2008, at 9:45 AM, Ricky Ho wrote: [Ricky] What exactly does the job.split contains ? I assume it contains the specification for each split (but not its data), such as what is the corresponding file and the byte range within that file. Correct ? Yes [Ricky] I am curious

Re: How to let Reducer know on which partition it is working

2008-11-26 Thread Owen O'Malley
On Nov 26, 2008, at 4:35 AM, Jürgen Broß wrote: I'm not sure how to let a Reducer know in its configure() method which partition it will get from the Partitioner, From: http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html#Task+JVM+Reuse look for mapred.task.partition, which is

Re: Block placement in HDFS

2008-11-24 Thread Owen O'Malley
On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote: Hi Dennis, I don't think that is possible to do. No, it is not possible. The block placement is determined by HDFS internally (which is local, rack local and off rack). Actually, it was changed in 0.17 or so to be node-local, off-rack,

Re: A question about the combiner, reducer and the Output value class: can they be different?

2008-11-16 Thread Owen O'Malley
On Sun, Nov 16, 2008 at 2:18 PM, Saptarshi Guha [EMAIL PROTECTED]wrote: Hello, If my understanding is correct, the combiner will read in values for a given key, process it, output it and then **all** values for a key are given to the reducer. Not quite. The flow looks like

Re: Writing to multiple output channels

2008-11-13 Thread Owen O'Malley
On Nov 13, 2008, at 9:38 PM, Sunil Jagadish wrote: I have a mapper which needs to write output into two different kinds of files (output.collect()). For my purpose, I do not need any reducers. Set the number of reduces to 0. Open a sequence file in the mapper and write the second stream to

Re: reduce more than one way

2008-11-09 Thread Owen O'Malley
On Nov 7, 2008, at 12:35 PM, Elia Mazzawi wrote: I have 2 hadooop map/reduce programs that have the same map, but a different reduce methods. can i run them in a way so that the map only happens once? If the input to the reduces is the same, you can put the two reduces together and use

Re: Hadoop Design Question

2008-11-06 Thread Owen O'Malley
On Nov 6, 2008, at 11:29 AM, Ricky Ho wrote: Disk I/O overhead == - The output of a Map task is written to a local disk and then later on upload to the Reduce task. While this enable a simple recovery strategy when the map task failed, it incur additional disk I/O

Re: More Hadoop Design Question

2008-11-06 Thread Owen O'Malley
On Nov 6, 2008, at 2:30 PM, Ricky Ho wrote: Hmmm, sounds like the combiner is invoked after the map() process completed for the file split. No. The data path is complex, but the combiner is called when the map outputs are being spilled to disk. So roughly, the map will output key, value

Re: key,Values at Reducer are local or present in DFS

2008-11-05 Thread Owen O'Malley
On Nov 5, 2008, at 2:21 PM, Tarandeep Singh wrote: I want to know whether the key,values received by a particular reducer at a node are stored locally on that node or are stored on DFS (and hence replicated over cluster according to replication factor set by user) Map outputs (and reduce

Re: Any Way to Skip Mapping?

2008-11-02 Thread Owen O'Malley
If you don't need a sort, which is what it sounds like, Hadoop supports that by turning off the reduce. That is done by setting the number of reduces to 0. This typically is much faster than if you need the sort. It also sounds like you may need/want the library that does map-side joins.

ApacheCon US 2008

2008-10-31 Thread Owen O'Malley
of talks about Hadoop o Introduction to Hadoop by Owen O'Malley o A Tour of Apache Hadoop by Tom White o Programming Hadoop Map/Reduce by Arun Murthy o Hadoop at Yahoo! by Eric Baldeschwieler o Hadoop Futures Panel o Using Hadoop

Re: Mapper settings...

2008-10-31 Thread Owen O'Malley
On Oct 31, 2008, at 3:15 PM, Bhupesh Bansal wrote: Why do we need these setters in JobConf ?? jobConf.setMapOutputKeyClass(String.class); jobConf.setMapOutputValueClass(LongWritable.class); Just historical. The Mapper and Reducer interfaces didn't use to be generic. (Hadoop used to run

Re: Is there a unique ID associated with each task?

2008-10-30 Thread Owen O'Malley
On Oct 30, 2008, at 9:03 AM, Joel Welling wrote: I'm writing a Hadoop Pipes application, and I need to generate a bunch of integers that are unique across all map tasks. If each map task has a unique integer ID, I can make sure my integers are unique by including that integer ID. I

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-29 Thread Owen O'Malley
I uploaded a patch that does a secondary sort. Take a look at: https://issues.apache.org/jira/browse/HADOOP-4545 It reads input with two numbers per a line. Such as: -1 -4 -3 23 5 10 -1 -2 -1 300 -1 10 4 1 4 2 4 10 4 -1 4 -10 10 20 10 30 10 25 And produces output like (with 2 reduces):

Re: Understanding file splits

2008-10-28 Thread Owen O'Malley
On Oct 28, 2008, at 6:29 AM, Malcolm Matalka wrote: I am trying to write an InputFormat and I am having some trouble understanding how my data is being broken up. My input is a previous hadoop job and I have added code to my record reader to print out the FileSplit's start and end position,

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-28 Thread Owen O'Malley
On Oct 28, 2008, at 7:53 AM, David M. Coe wrote: My mapper is MapperLongWritable, Text, IntWritable, IntWritable and my reducer is the identity. I configure the program using: conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(IntWritable.class);

Re: help: InputFormat problem ?

2008-10-27 Thread Owen O'Malley
If your application that you are drawing from is doing some sort of web crawl that connects to lots of random servers, you may want to use MultiThreadedMapRunner and do the remote connections in the map. If you are just connecting to a small set of servers for each map, you should put it

Re: Help: How to change number of mappers in Hadoop streaming?

2008-10-26 Thread Owen O'Malley
On Oct 26, 2008, at 8:38 AM, chaitanya krishna wrote: I forgot to mention that although the number of map tasks are set in the code as I mentioned before, the actual number of map tasks are not essentially the same number but is very close to this number. The number of reduces is precisely

Re: Any one successfully ran the c++ pipes example?

2008-10-17 Thread Owen O'Malley
On Oct 16, 2008, at 1:40 PM, Zhengguo 'Mike' SUN wrote: I was trying to write an application using the pipes api. But it seemed the serialization part is not working correctly. More specifically, I can't deserialize a string from an StringInStream constructed from context.getInputSplit().

Re: Distributed cache Design

2008-10-16 Thread Owen O'Malley
On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote: We at Linkedin are trying to run some Large Graph Analysis problems on Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- cores machine with 8G

Re: Distributed cache Design

2008-10-16 Thread Owen O'Malley
On Oct 16, 2008, at 3:09 PM, Bhupesh Bansal wrote: Lets say I want to implement a DFS in my graph. I am not able to picturise implementing it with doing graph in pieces without putting a depth bound to (3-4). Lets say we have 200M (4GB) edges to start with Start by watching the lecture

Re: Seeking examples beyond word count

2008-10-14 Thread Owen O'Malley
On Oct 14, 2008, at 8:37 PM, Bert Schmidt wrote: I'm trying to think of how I might use it yet all the examples I find are variations of word count. Look in the src/examples directory. PiEstimator - estimates the value of pi using distributed brute force Pentomino - solves Pentomino

Re: Sharing an object across mappers

2008-10-03 Thread Owen O'Malley
On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote: Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. Sure, but it handles the distribution problem for you. One still needs to read the contents into each map/reduce task

  1   2   >