Using different file systems for Map Reduce job input and output
Hi, I wanted to know if it is possible to use different file systems for Map Reduce job input and output. I.e. have a M/R job input reside on one file system and the M/R output be written to another file system (e.g. input on HDFS, output on KFS. Input on HDFS output on local file system, or anything else ...). Is it possible to somehow specify that through FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ? Or by any other mechanism ? Thanks, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein)
Re: Using different file systems for Map Reduce job input and output
Hi Naama, Yes. It is possible to specify using the apis FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath(). You can specify the FileSystem uri for the path. Thanks, Amareshwari Naama Kraus wrote: Hi, I wanted to know if it is possible to use different file systems for Map Reduce job input and output. I.e. have a M/R job input reside on one file system and the M/R output be written to another file system (e.g. input on HDFS, output on KFS. Input on HDFS output on local file system, or anything else ...). Is it possible to somehow specify that through FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ? Or by any other mechanism ? Thanks, Naama
A scalable gallery with hadoop?
Hi, I am a new user.I need to develop a huge mediagallery. My reqs in a nutshell are a high scalability on the number of users, reliability of users' data (photos, videos, docs, etc.. uploaded by users) and an internal search engine. I've seen some posts about the applicability of Hadoop on web apps, mainly with negative response (ie: http://www.nabble.com/Hadoop-also-applicable-in-a-web-app-environment--to18836915.html#a18836915 ) I know about Amazon AWS and I think that maybe S3+EC2 could be a solution (even if I still don't know how to integrate the search engine), but I would like to not close the opportunity of using my own HW in the future. I've seen that Hadoop provides API for using S3 HDFS and so i thought this (hadoop framework) was the right layer on which my app should be based (allowing me to store data locally or on S3 without changing the application layer). Now that I've configured a clustered environment with Hadoop, I'm not still so sure about that. I am a perfect newbie on this topic, so any suggestion is welcome! (maybe the right framework could be Hbase?) Thanks, Alberto.
Re: Using different file systems for Map Reduce job input and output
Thanks ! Naama On Mon, Oct 6, 2008 at 10:27 AM, Amareshwari Sriramadasu [EMAIL PROTECTED] wrote: Hi Naama, Yes. It is possible to specify using the apis FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath(). You can specify the FileSystem uri for the path. Thanks, Amareshwari Naama Kraus wrote: Hi, I wanted to know if it is possible to use different file systems for Map Reduce job input and output. I.e. have a M/R job input reside on one file system and the M/R output be written to another file system (e.g. input on HDFS, output on KFS. Input on HDFS output on local file system, or anything else ...). Is it possible to somehow specify that through FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ? Or by any other mechanism ? Thanks, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein)
Re: Hadoop and security.
Dmitry Pushkarev wrote: Dear hadoop users, I'm lucky to work in academic environment where information security is not the question. However, I'm sure that most of the hadoop users aren't. Here is the question: how secure hadoop is? (or let's say foolproof) Right now hadoop is about as secure as NFS. when deployed onto private datacentres with good physical security and well set up networks, you can control who gets at the data. Without that, you are sharing your state with anyone who can issue HTTP and hadoop IPC requests. Here is the answer: http://www.google.com/search?client=opera http://www.google.com/search?client=operarls=enq=Hadoop+Map/Reduce+Admini strationsourceid=operaie=utf-8oe=utf-8 rls=enq=Hadoop+Map/Reduce+Administrationsourceid=operaie=utf-8oe=utf-8 not quite. see also http://www.google.com/search?q=axis+happiness+page ; pages that we add for benefit of the ops team end up sneaking out into the big net. What we're seeing here is open hadoop cluster, where anyone who capable of installing hadoop and changing his username to webcrawl can use their cluster and read their data, even though firewall is perfectly installed and ports like ssh are filtered to outsiders. After you've played enough with data, you can observe that you can submit jobs as well, and these jobs can execute shell commands. Which is very, very sad. In my view, this significantly limits distributed hadoop applications, where part of your cluster may reside on EC2 or other distant datacenter, since you always need to have certain ports open to an array of ip addresses (if your instances are dynamic) which isn't acceptable if anyone from that ip range can connect to your cluster. well, maybe that's a fault of EC2s architecture in which a deployment request doesn't include a declaration of the network configuration? Can we propose to developers to introduce some basic user-management and access controls to help hadoop make one step further towards production-quality system? Being an open source project, you can do more than propose, you can help build some basic user-management and access controls. As to production quality; it is ready for production, albeit in locked down datacentres. Which is the primary deployment infrastructure of many of the active developers. As in most community-contributed open source projects, if you have specific needs beyond what the active developers need, you end up implementing them your self. The big issue with security is that it is all or nothing. Right now it is blatantly insecure, so you should not be surprised that anyone has access to your files. To actually lock it down, you would need to authenticate and possibly encrypt all communications; this adds a lot of overhead, which is why it will be avoided in the big datacentres. You also need to go to a lot of effort to make sure it is secure across the board, with no JSP pages providing accidental privilege escalation, no api calls letting you see stuff you shouldn't. Its not like a normal feature defect where you can say don't do that; it's not so easy to validate using functional tests that test the expected uses of the code. This is why securing an application is such a hard thing to do. -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Hadoop and security.
You bring up some valid points. This would be a great topic for a white paper. The first line of defense should be to apply inbound and outbound iptables rules. Only source IPs that have a direct need to interact with the cluster should be allowed to. The same is true with the web access. Only a range of source IP's should be allowed to access the web interfaces. You can do this through SSH tunneling. Preventing exec commands can be handled with the security manager and the sandbox. I was thinking to only allow the execution of signed jars myself but I never implemented it.
Re: architecture diagram
Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you
Re: Hadoop and security.
On 10/6/08 6:39 AM, Steve Loughran [EMAIL PROTECTED] wrote: Edward Capriolo wrote: You bring up some valid points. This would be a great topic for a white paper. -a wiki page would be a start too I was thinking about doing Deploying Hadoop Securely for a ApacheCon EU talk, as by that time, some of the basic Kerberos stuff should be in place... This whole conversation is starting to reinforce the idea
Re: Hadoop and security.
Edward Capriolo wrote: You bring up some valid points. This would be a great topic for a white paper. -a wiki page would be a start too The first line of defense should be to apply inbound and outbound iptables rules. Only source IPs that have a direct need to interact with the cluster should be allowed to. The same is true with the web access. Only a range of source IP's should be allowed to access the web interfaces. You can do this through SSH tunneling. Preventing exec commands can be handled with the security manager and the sandbox. I was thinking to only allow the execution of signed jars myself but I never implemented it. -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Hadoop and security.
Allen Wittenauer wrote: On 10/6/08 6:39 AM, Steve Loughran [EMAIL PROTECTED] wrote: Edward Capriolo wrote: You bring up some valid points. This would be a great topic for a white paper. -a wiki page would be a start too I was thinking about doing Deploying Hadoop Securely for a ApacheCon EU talk, as by that time, some of the basic Kerberos stuff should be in place... This whole conversation is starting to reinforce the idea -Start with an ApacheCon US fastfeather talk on the current state of affairs a hadoop cluster is as secure as a farm of machines running NFS. Just to let people know where things stand. for the EU one, I will probably put in for one on deploying/managing using our toolset. I'm also thinking of a talk datamining a city that looks at what data sources a city is already instrumented with, if only you could get at them. The hard part is getting at them. I have my eye on our local speed camera/red light cameras, that track the speed of every vehicle passing and time of day; you could build up a map of traffic velocity, where the jams are, when, etc. Getting the machine-parsed number plate data would be even more interesting, but governments tend to restrict that data to state security, rather than useful things like analysing and predicting traffic flow.
Re: architecture diagram
As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With
nagios to monitor hadoop datanodes!
Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Searching Lucene Index built using Hadoop
I'm trying to index a large dataset using Hadoop+Lucene. I used the example under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to find a way to search the index that was successfully built. I tried copying over the index to one machine and merging them using IndexWriter.addIndexesNoOptimize(). I would like hear your input on the best way to index+search large datasets. Thanks, Saranath -- View this message in context: http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: How to GET row name/column name in HBase using JAVA API
Please use the HBase mailing list for HBase-related questions: http://hadoop.apache.org/hbase/mailing_lists.html#Users Regards your question, have you looked at http://wiki.apache.org/hadoop/Hbase/HbaseRest ? J-D On Mon, Oct 6, 2008 at 12:05 AM, Trinh Tuan Cuong [EMAIL PROTECTED] wrote: Hello guys, I m trying to use Java to manipluate HBase using its API. Atm, I m trying to do some simple CRUDs activities with it and from here a little problem arise. What I was trying to make is : Given an existing table, display/get existed row name ( in order to update, manipulate datas ) , columns name ( in String ? – or bytes ? ) , I could see that HstoreKey in HBase API allow to get the Row values but it wont specific what table(s) it working on ?? So if any of you could please, show me how to get rows, columns family names of a given table, I would thanks in advance. Best Regards, Trịnh Tuấn Cường Luvina Software Company Website : www.luvina.net Address : 1001 Hoang Quoc Viet Street Email : [EMAIL PROTECTED],[EMAIL PROTECTED] Mobile : 097 4574 457
Re: Searching Lucene Index built using Hadoop
Hi, you might find http://katta.wiki.sourceforge.net/ interesting. If you have any katta releated question please use the katta mailing list. Stefan ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:26 AM, Saranath wrote: I'm trying to index a large dataset using Hadoop+Lucene. I used the example under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to find a way to search the index that was successfully built. I tried copying over the index to one machine and merging them using IndexWriter.addIndexesNoOptimize(). I would like hear your input on the best way to index+search large datasets. Thanks, Saranath -- View this message in context: http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Weird problem running wordcount example from within Eclipse
Hi all, I have a weird problem regarding running the wordcount example from eclipse. I was able to run the wordcount example from the command line like: $ ...MyHadoop/bin/hadoop jar ../MyHadoop/hadoop-xx-examples.jar wordcount myinputdir myoutputdir However, if I try to run the wordcount program from Eclipse (supplying same two args: myinputdir myoutputdir) I got the following error messsage: Exception in thread main java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:356) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:331) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:304) at org.apache.hadoop.examples.WordCount.run(WordCount.java:149) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.WordCount.main(WordCount.java:161) Caused by: java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1277) at org.apache.hadoop.fs.FileSystem.access$1(FileSystem.java:1273) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:352) ... 5 more It seems from within Eclipse, the program does not know how to interpret the myinputdir as a hadoop path? Can someone please tell me how I can fix this? Thanks a lot!!!
Questions regarding adding resource via Configuration
Hi, I have a configuration file (similar to hadoop-site.xml) and I want to include this file as a resource while running Map-Reduce jobs. Similarly, I want to add a jar file that is required by Mappers and Reducers ToolRunner.run( ...) allows me to do this easily, my question is can I add these files permanently? I am running a lot of different Map-Reduce jobs in a loop, so is there a way I can add these files once and subsequent jobs need not to add them ? Also, if I don't implement Tool and don't use ToolRunner.run, but call Configuration.addresource( ), then will the parameters defined in my configuration file available in mappers and reducers ? Thanks, Taran
Re: Turning off FileSystem statistics during MapReduce
We see this on Maps and only on incrementBytesRead (not on incrementBytesWritten). It is on HDFS where we are seeing the time spent. It seems that this is because incrementBytesRead is called every time a record is read, while incrementBytesWritten is only called when a buffer is spilled. We would benefit a lot from being able to turn this off. On Oct 3, 2008, at 6:19 PM, Arun C Murthy wrote: Nathan, On Oct 3, 2008, at 5:18 PM, Nathan Marz wrote: Hello, We have been doing some profiling of our MapReduce jobs, and we are seeing about 20% of the time of our jobs is spent calling FileSystem$Statistics.incrementBytesRead when we interact with the FileSystem. Is there a way to turn this stats-collection off? This is interesting... could you provide more details? Are you seeing this on Maps or Reduces? Which FileSystem exhibited this i.e. HDFS or LocalFS? Any details on about your application? To answer your original question - no, there isn't a way to disable this. However, if this turns out to be a systemic problem we definitely should consider having an option to allow users to switch it off. So any information you can provide helps - thanks! Arun Thanks, Nathan Marz Rapleaf
Add jar file via -libjars - giving errors
Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar I got this error- java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder Can someone please tell me what needs to be fixed here ? Thanks, Taran
Re: Add jar file via -libjars - giving errors
HI Tarandeep, the libjars options does not add the jar on the client side. Their is an open jira for that ( id ont remember which one)... Oyu have to add the jar to the HADOOP_CLASSPATH on the client side so that it gets picked up on the client side as well. mahadev On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar I got this error- java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder Can someone please tell me what needs to be fixed here ? Thanks, Taran
Re: architecture diagram
So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare
Re: Add jar file via -libjars - giving errors
thanks Mahadev for the reply. So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on all slave machines like before. One more question- I am adding a conf file (just like HADOOP_SITE.xml) via -conf option and I am able to query parameters in mapper/reducers. But is there a way I can query the parameters in my job driver class - public class jobDriver extends Configured { someMethod( ) { ToolRunner.run( new MyJob( ), commandLineArgs); // I want to query parameters present in my conf file here } } public class MyJob extends Configured implements Tool { } Thanks, Taran On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED] wrote: HI Tarandeep, the libjars options does not add the jar on the client side. Their is an open jira for that ( id ont remember which one)... Oyu have to add the jar to the HADOOP_CLASSPATH on the client side so that it gets picked up on the client side as well. mahadev On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar I got this error- java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder Can someone please tell me what needs to be fixed here ? Thanks, Taran
Re: architecture diagram
This mapper does follow my original suggestion, though I'm not familiar with how the delimiter works in this example. Anyone else? Alex On Mon, Oct 6, 2008 at 2:55 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get a bad key index? If so, would the row segments be determined? I based my initial work off of the word count example, where the lines are tokenized. Does this mean in this example the row tokens may not be the complete row? Thanks. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 7:14 PM The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents
Re: is 12 minutes ok for dfs chown -R on 45000 files ?
On 10/2/08 11:33 PM, Frank Singleton [EMAIL PROTECTED] wrote: Just to clarify, this is for when the chown will modify all files owner attributes eg: toggle all from frank:frank to hadoop:hadoop (see below) When we converted from 0.15 to 0.16, we chown'ed all of our files. The local dev team wrote the code in https://issues.apache.org/jira/browse/HADOOP-3052 , but it wasn't committed as a standard feature as they viewed this as a one off. :( Needless to say, running a large chown as a MR job should be significantly faster.
Why is super user privilege required for FS statistics?
Hey all, I noticed something really funny about fuse-dfs: because super-user privileges are required to run the getStats function in FSNamesystem.java, my file systems show up as having 16 exabytes total and 0 bytes free. If I mount fuse-dfs as root, then I get the correct results from df. Is this an oversight? Is there any good reason I shouldn't file a bug to make the getStats command (which only returns the used, free, and total space in the file system) not require superuser privilege? Brian smime.p7s Description: S/MIME cryptographic signature
Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?
Dears, Sorry, I did not mean to cross post. But the previous article was accidentally posted to the HBase user list. I would like to bring it back to the Hadoop user since it is confusing me a lot and it is mainly MapReduce related. Currently running version hadoop-0.18.1 on 25 nodes. Map and Reduce Task Capacity is 92. When I do this in my MapReduce program: = SAMPLE CODE = JobConf jconf = new JobConf(conf, TestTask.class); jconf.setJobName(my.test.TestTask); jconf.setOutputKeyClass(Text.class); jconf.setOutputValueClass(Text.class); jconf.setOutputFormat(TextOutputFormat.class); jconf.setMapperClass(MyMapper.class); jconf.setCombinerClass(MyReducer.class); jconf.setReducerClass(MyReducer.class); jconf.setInputFormat(TextInputFormat.class); try { jconf.setNumMapTasks(5); jconf.setNumReduceTasks(3); JobClient.runJob(jconf); } catch (Exception e) { e.printStackTrace(); } = = = When I run the job, I'm always getting 300 mappers and 1 reducers from the JobTracker webpage running on the default port 50030. No matter how I configure the numbers in methods setNumMapTasks and setNumReduceTasks, I get the same result. Does anyone know why this is happening? Am I missing something or misunderstand something in the picture? =( Here's a reference to the parameters we have override in hadoop-site.xml. === property namemapred.tasktracker.map.tasks.maximum/name value4/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value /property other parameters are default from hadoop-default.xml. Any idea how this is happening? Any inputs are appreciated. Thanks, -Andy
Re: architecture diagram
I think what Alex talked about 'split' is the mapreduce system's action. What you said about 'split' is your mapper's action. I guess that your map/reduce application uses *TextInputFormat* to treat your input file. your input file will first be splitted into a few splits. these splits may be like filename, offset, length. What Alex said about 'The location of these splits is semi-arbitrary' means that the file split's offset in your input file is semi-arbitrary. Am I right, Alex? Then *TextInputFormat* will translate these file splits into a sequence of lines, where offset is treated as key and line is treated as value. As these file splits are splitted by offset. Some lines in your file may be splitted into different file splits. A *LineRecordReader* used by *TextInputFormat* will remove the half-baked line in these file splits to make sure that every mapper will get integrated lines one by one. For examples: a file as below: AAA BBB CCC DDD EEE FFF GGG HHH AAA BBB CCC DDD it may be splitted into two file splits(we assume that there are two mappers.). split one: AAA BBB CCC split two: DDD EEE FFF GGG HHH AAA BBB CCC DDD take split two as example: TextInputFormat will use LineRecordReader to translate split two into a sequence of offset, line pairs, and it will skip the first half-baked line DDD. so the sequence will be: offset1, EEE FFF GGG HHH offset2, AAA BBB CCC DDD Then what to do with the lines depends on your job. On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So looking at the following mapper... http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup On line 32, you can see the row split via a delimiter. On line 43, you can see that the field index (the column index) is the map key, and the map value is the field contents. How is this incorrect? I think this follows your earlier suggestion of: You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. Terrence A. Pietrondi --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Monday, October 6, 2008, 12:55 PM As far as I know, splits will never be made within a line, only between rows. To answer your question about ways to control the splits, see below: http://wiki.apache.org/hadoop/HowManyMapsAndReduces http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html Alex On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Can you explain The location of these splits is semi-arbitrary? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Sunday, October 5, 2008, 9:26 PM Let's say you have one very large input file of the form: A|B|C|D E|F|G|H ... |1|2|3|4 This input file will be broken up into N pieces, where N is the number of mappers that run. The location of these splits is semi-arbitrary. This means that unless you have one mapper, you won't be able to see the entire contents of a column in your mapper. Given that you would need one mapper to be able to see the entirety of a column, you've now essentially reduced your problem to a single machine. You may want to play with the following idea: collect key = column_number and value = column_contents in your map step. This means that you would be able to see the entirety of a column in your reduce step, though you're still faced with the tasks of shuffling and re-pivoting. Does this clear up your confusion? Let me know if you'd like me to clarify more. Alex On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am not sure why this doesn't fit, maybe you can help me understand. Your previous comment was... The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. Are you saying that my row segments might not actually be the entire row so I will get
Re: Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?
Mapper's Number depends on your inputformat. Default Inputformat try to treat every file block of a file as a InputSplit. And you will get the same number of mappers as the number of your inputsplits. try to configure mapred.min.split.size to reduce the number of your mapper if you want to. And I don't know why your reducer is just one. Anyone knows? On Tue, Oct 7, 2008 at 9:06 AM, Andy Li [EMAIL PROTECTED] wrote: Dears, Sorry, I did not mean to cross post. But the previous article was accidentally posted to the HBase user list. I would like to bring it back to the Hadoop user since it is confusing me a lot and it is mainly MapReduce related. Currently running version hadoop-0.18.1 on 25 nodes. Map and Reduce Task Capacity is 92. When I do this in my MapReduce program: = SAMPLE CODE = JobConf jconf = new JobConf(conf, TestTask.class); jconf.setJobName(my.test.TestTask); jconf.setOutputKeyClass(Text.class); jconf.setOutputValueClass(Text.class); jconf.setOutputFormat(TextOutputFormat.class); jconf.setMapperClass(MyMapper.class); jconf.setCombinerClass(MyReducer.class); jconf.setReducerClass(MyReducer.class); jconf.setInputFormat(TextInputFormat.class); try { jconf.setNumMapTasks(5); jconf.setNumReduceTasks(3); JobClient.runJob(jconf); } catch (Exception e) { e.printStackTrace(); } = = = When I run the job, I'm always getting 300 mappers and 1 reducers from the JobTracker webpage running on the default port 50030. No matter how I configure the numbers in methods setNumMapTasks and setNumReduceTasks, I get the same result. Does anyone know why this is happening? Am I missing something or misunderstand something in the picture? =( Here's a reference to the parameters we have override in hadoop-site.xml. === property namemapred.tasktracker.map.tasks.maximum/name value4/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value /property other parameters are default from hadoop-default.xml. Any idea how this is happening? Any inputs are appreciated. Thanks, -Andy
Re: Add jar file via -libjars - giving errors
Adding your jar files in the $HADOOP_HOME/lib folder works, but you would have to restart all your tasktrackers to have your jar files loaded. If you repackage your map-reduce jar file (e.g. hadoop-0.18.0-examples.jar) with your jar file and run your job with the newly repackaged jar file, it would work, too. On Tue, Oct 7, 2008 at 6:55 AM, Tarandeep Singh [EMAIL PROTECTED] wrote: thanks Mahadev for the reply. So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on all slave machines like before. One more question- I am adding a conf file (just like HADOOP_SITE.xml) via -conf option and I am able to query parameters in mapper/reducers. But is there a way I can query the parameters in my job driver class - public class jobDriver extends Configured { someMethod( ) { ToolRunner.run( new MyJob( ), commandLineArgs); // I want to query parameters present in my conf file here } } public class MyJob extends Configured implements Tool { } Thanks, Taran On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED] wrote: HI Tarandeep, the libjars options does not add the jar on the client side. Their is an open jira for that ( id ont remember which one)... Oyu have to add the jar to the HADOOP_CLASSPATH on the client side so that it gets picked up on the client side as well. mahadev On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar I got this error- java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder Can someone please tell me what needs to be fixed here ? Thanks, Taran
Re: nagios to monitor hadoop datanodes!
The easiest approach I can think of is to write a simple Nagios plugin that checks if the datanode JVM process is alive. Or you may write a Nagios-plugin that checks for error or warning messages in datanode logs. (I am sure you can find quite a few log-checking Nagios plugin in nagiosplugin.org) If you are unsure of how to write nagios-plugin, I suggest you to read stuff from link Leverage Nagios with plug-ins you write http://www.ibm.com/developerworks/aix/library/au-nagios/ as it's got good explanations and examples on how to write nagios plugin. Or if you've got time to burn, you might want to read Nagios documentation, too. Let me know if you need help on this matter. /Taeho On Tue, Oct 7, 2008 at 2:05 AM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: Add jar file via -libjars - giving errors
You can just add the jar to the env variable HADOOP_CLASSPATH If using bash Just do this : Export HADOOP_CLASSPATH=path to your class path on the client And then use the libjars option. mahadev On 10/6/08 2:55 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: thanks Mahadev for the reply. So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on all slave machines like before. One more question- I am adding a conf file (just like HADOOP_SITE.xml) via -conf option and I am able to query parameters in mapper/reducers. But is there a way I can query the parameters in my job driver class - public class jobDriver extends Configured { someMethod( ) { ToolRunner.run( new MyJob( ), commandLineArgs); // I want to query parameters present in my conf file here } } public class MyJob extends Configured implements Tool { } Thanks, Taran On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED] wrote: HI Tarandeep, the libjars options does not add the jar on the client side. Their is an open jira for that ( id ont remember which one)... Oyu have to add the jar to the HADOOP_CLASSPATH on the client side so that it gets picked up on the client side as well. mahadev On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar I got this error- java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder Can someone please tell me what needs to be fixed here ? Thanks, Taran
Re: Add jar file via -libjars - giving errors
Hi, From 0.19, the jars added using -libjars are available on the client classpath also, fixed by HADOOP-3570. Thanks Amareshwari Mahadev Konar wrote: HI Tarandeep, the libjars options does not add the jar on the client side. Their is an open jira for that ( id ont remember which one)... Oyu have to add the jar to the HADOOP_CLASSPATH on the client side so that it gets picked up on the client side as well. mahadev On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar I got this error- java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder Can someone please tell me what needs to be fixed here ? Thanks, Taran