FW: NullPointer during startup debugging DN
Cross-posting with common-user, since there's only little activity on hdfs-user these last days. Evert Hi list, I'm having trouble starting up a DN (0.20.2) with Kerberos authentication and SSL enabled - I'm getting a NullPointerException during startup and the daemon exists. It's a bit hard to debug this problem, no idea how I'd do this from within Eclipse for example. Can I do this with jdb? Some background (also see relevant snippets from debug output, hdfs- site, core-site and ssl-server attached): 12/02/21 11:24:11 DEBUG security.Krb5AndCertsSslSocketConnector: useKerb = false, useCerts = true jetty.ssl.password : jetty.ssl.keypassword : 12/02/21 11:24:11 INFO mortbay.log: jetty- 6.1.26.cloudera.1 12/02/21 11:24:11 INFO mortbay.log: Started SelectChannelConnector@p- worker02.alley.sara.nl:1006 12/02/21 11:24:11 DEBUG security.Krb5AndCertsSslSocketConnector: Creating new KrbServerSocket for: p-worker02.alley.sara.nl 12/02/21 11:24:11 WARN mortbay.log: java.lang.NullPointerException 12/02/21 11:24:11 WARN mortbay.log: failed krb5andcertssslsocketconnec...@p-worker02.alley.sara.nl:50475: java.io.IOException: !JsseListener: java.lang.NullPointerException I'm a bit surprised that useKrb is set to false in Krb5AndCertsSslSocketConnector, but looking at org.apache.hadoop.hdfs.server.datanode.DataNode is see that it calls this.infoServer.addSslListener(secInfoSocAddr, sslConf, needClientAuth), which sets needKrbAuth to false. I guess this is on purpose then, and the values in core-site are just ignored here. The NullPointer seems to occur at org.apache.hadoop.security.Krb5AndCertsSslSocketConnector, in newServerSocket(). useCerts is true, and I see a call (SSLServerSocket)super.newServerSocket(host, port, backlog). I think things might go wrong here. This is probably due to some missing configuration. I have not set dfs.https.need.client.auth, and it defaults to false so I have not included a ssl-client.xml configuration file or key- and truststores for clients. I wouldn't mind doing that, but I'm not sure why I need a keystore for clients - I guess the framework checks for DN-to-user mappings, it shouldn't need user keys. Any help is much appreciated! Evert
Re: Should splittable Gzip be a core hadoop feature?
Let's play devil's advocate for a second? Why? Snappy exists. The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively. Next question is how large are the gzip files in the first place? I don't disagree, I just want to have a solid argument in favor of it... Sent from a remote device. Please excuse any typos... Mike Segel On Feb 28, 2012, at 9:50 AM, Niels Basjes ni...@basjes.nl wrote: Hi, Some time ago I had an idea and implemented it. Normally you can only run a single gzipped input file through a single mapper and thus only on a single CPU core. What I created makes it possible to process a Gzipped file in such a way that it can run on several mappers in parallel. I've put the javadoc I created on my homepage so you can read more about the details. http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec Now the question that was raised by one of the people reviewing this code was: Should this implementation be part of the core Hadoop feature set? The main reason that was given is that this needs a bit more understanding on what is happening and as such cannot be enabled by default. I would like to hear from the Hadoop Core/Map reduce users what you think. Should this be - a part of the default Hadoop feature set so that anyone can simply enable it by setting the right configuration? - a separate library? - a nice idea I had fun building but that no one needs? - ... ? -- Best regards / Met vriendelijke groeten, Niels Basjes
Hadoop fair scheduler doubt: allocate jobs to pool
How can I set the fair scheduler such that all jobs submitted from a particular user group go to a pool with the group name? I have setup fair scheduler and I have two users: A and B (belonging to the user group hadoop) When these users submit hadoop jobs, the jobs from A got to a pool named A and the jobs from B go to a pool named B. I want them to go to a pool with their group name, So I tried adding the following to mapred-site.xml: property namemapred.fairscheduler.poolnameproperty/name valuegroup.name/value /property But instead the jobs now go to the default pool. I want the jobs submitted by A and B to go to the pool named hadoop. How do I do that? also how can I explicity set a job to any specified pool? I have set the allocation file (fair-scheduler.xml) like this: allocations pool name=hadoop minMaps1/minMaps minReduces1/minReduces maxMaps3/maxMaps maxReduces3/maxReduces /pool userMaxJobsDefault5/userMaxJobsDefault /allocations Any help is greatly appreciated. Thanks, Austin
TaskTracker without datanode
Hi All, I was wondering (network traffic considerations aside) is it possible to run a TaskTracker without a DataNode. I was hoping to test this method as a means of scaling processing power temporarily. Are there better approaches, I don't (currently) need the additional storage that a DataNode provides and I would like to add additional processing power from time-to-timewithout worrying about data loss and decommissioning DataNodes. Thanks, Dan.
RE: TaskTracker without datanode
Forgot to mention that I am using Hadoop 0.20.2 From: Daniel Baptista Sent: 29 February 2012 14:44 To: common-user@hadoop.apache.org Subject: TaskTracker without datanode Hi All, I was wondering (network traffic considerations aside) is it possible to run a TaskTracker without a DataNode. I was hoping to test this method as a means of scaling processing power temporarily. Are there better approaches, I don't (currently) need the additional storage that a DataNode provides and I would like to add additional processing power from time-to-timewithout worrying about data loss and decommissioning DataNodes. Thanks, Dan.
Re: TaskTracker without datanode
Yes this is fine to do. TTs are not dependent on co-located DNs, but only benefit if they are. On Wed, Feb 29, 2012 at 8:14 PM, Daniel Baptista daniel.bapti...@performgroup.com wrote: Forgot to mention that I am using Hadoop 0.20.2 From: Daniel Baptista Sent: 29 February 2012 14:44 To: common-user@hadoop.apache.org Subject: TaskTracker without datanode Hi All, I was wondering (network traffic considerations aside) is it possible to run a TaskTracker without a DataNode. I was hoping to test this method as a means of scaling processing power temporarily. Are there better approaches, I don't (currently) need the additional storage that a DataNode provides and I would like to add additional processing power from time-to-timewithout worrying about data loss and decommissioning DataNodes. Thanks, Dan. -- Harsh J
Re: Should splittable Gzip be a core hadoop feature?
Mike, Snappy is cool and all, but I was not overly impressed with it. GZ zipps much better then Snappy. Last time I checked for our log file gzip took them down from 100MB- 40MB, while snappy compressed them from 100MB-55MB. That was only with sequence files. But still that is pretty significant if you are considering long term storage. Also being that the delta in the file size was large I could not actually make the agree that using sequence+snappy was faster then sequence+gz. Sure the MB/s rate was probably faster but since I had more MB I was not able to prove snappy a win. I use it for intermediate compression only. Actually the raw formats (gz vs sequence gz) are significantly smaller and faster then their sequence file counterparts. Believe it or not, I commonly use mapred.compress.output without sequence files. As long as I have a larger number of reducers I do not have to worry about files being splittable because N mappers process N files. Generally I am happpy with say N mappers because the input formats tend to create more mappers then I want which makes more overhead and more shuffle. But being able to generate split info for them and processing them would be good as well. I remember that was a hot thing to do with lzo back in the day. The pain of once overing the gz files to generate the split info is detracting but it is nice to know it is there if you want it. Edward On Wed, Feb 29, 2012 at 7:10 AM, Michel Segel michael_se...@hotmail.com wrote: Let's play devil's advocate for a second? Why? Snappy exists. The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively. Next question is how large are the gzip files in the first place? I don't disagree, I just want to have a solid argument in favor of it... Sent from a remote device. Please excuse any typos... Mike Segel On Feb 28, 2012, at 9:50 AM, Niels Basjes ni...@basjes.nl wrote: Hi, Some time ago I had an idea and implemented it. Normally you can only run a single gzipped input file through a single mapper and thus only on a single CPU core. What I created makes it possible to process a Gzipped file in such a way that it can run on several mappers in parallel. I've put the javadoc I created on my homepage so you can read more about the details. http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec Now the question that was raised by one of the people reviewing this code was: Should this implementation be part of the core Hadoop feature set? The main reason that was given is that this needs a bit more understanding on what is happening and as such cannot be enabled by default. I would like to hear from the Hadoop Core/Map reduce users what you think. Should this be - a part of the default Hadoop feature set so that anyone can simply enable it by setting the right configuration? - a separate library? - a nice idea I had fun building but that no one needs? - ... ? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Should splittable Gzip be a core hadoop feature?
Hi, On Wed, Feb 29, 2012 at 13:10, Michel Segel michael_se...@hotmail.comwrote: Let's play devil's advocate for a second? I always like that :) Why? Because then datafiles from other systems (like the Apache HTTP webserver) can be processed without preprocessing more efficiently. Snappy exists. Compared to gzip: Snappy is faster, compresses a bit less and is unfortunately not splittable. The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively. Yes, that and the fact that the files are smaller. Note that I've described some of these considerations in the javadoc. Next question is how large are the gzip files in the first place? I work for the biggest webshop in the Netherlands and I'm facing a set of logfiles that are very often 1 GB each and are gzipped. The first thing we do with then is parse and disect each line in the very first mapper. Then we store the result in (snappy compressed) avro files. I don't disagree, I just want to have a solid argument in favor of it... :) -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Should splittable Gzip be a core hadoop feature?
Hi, On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote: ... But being able to generate split info for them and processing them would be good as well. I remember that was a hot thing to do with lzo back in the day. The pain of once overing the gz files to generate the split info is detracting but it is nice to know it is there if you want it. Note that the solution I created (HADOOP-7076) does not require any preprocessing. It can split ANY gzipped file as-is. The downside is that this effectively costs some additional performance because the task has to decompress the first part of the file that is to be discarded. The other two ways of splitting gzipped files either require - creating come kind of compression index before actually using the file (HADOOP-6153) - creating a file in a format that is gerenated in such a way that it is really a set of concatenated gzipped files. (HADOOP-7909) -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Should splittable Gzip be a core hadoop feature?
I can see a use for it, but I have two concerns about it. My biggest concern is maintainability. We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot. I am not saying that it will happen with this, but if you have to ask if people will use it and there has been no overwhelming yes, it makes me nervous about it. My second concern is with knowing when to use this. Anything that adds this in would have to come with plenty of documentation about how it works, how it is different from the normal gzip format, explanations about what type of a load it might put on data nodes that hold the start of the file, etc. From both of these I would prefer to see this as a github project for a while first, and one it shows that it has a significant following, or a community with it, then we can pull it in. But if others disagree I am not going to block it. I am a -0 on pulling this in now. --Bobby On 2/29/12 10:00 AM, Niels Basjes ni...@basjes.nl wrote: Hi, On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote: ... But being able to generate split info for them and processing them would be good as well. I remember that was a hot thing to do with lzo back in the day. The pain of once overing the gz files to generate the split info is detracting but it is nice to know it is there if you want it. Note that the solution I created (HADOOP-7076) does not require any preprocessing. It can split ANY gzipped file as-is. The downside is that this effectively costs some additional performance because the task has to decompress the first part of the file that is to be discarded. The other two ways of splitting gzipped files either require - creating come kind of compression index before actually using the file (HADOOP-6153) - creating a file in a format that is gerenated in such a way that it is really a set of concatenated gzipped files. (HADOOP-7909) -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: 100x slower mapreduce compared to pig
I am going to try few things today. I have a JAXBContext object that marshals the xml, this is static instance but my guess at this point is that since this is in separate jar then the one where job runs and I used DistributeCache.addClassPath this context is being created on every call for some reason. I don't know why that would be. I am going to create this instance as static in the mapper class itself and see if that helps. I also add debugs. Will post the results after try it out. On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.comwrote: It would be great if we can take a look at what you are doing in the UDF vs the Mapper. 100x slow does not make sense for the same job/logic, its either the Mapper code or may be the cluster was busy at the time you scheduled MapReduce job? Thanks, Prashant On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting job like: java -classpath .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar com.services.dp.analytics.hadoop.mapred.FormMLProcessor /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq /examples/output1/ How should I go about looking the root cause of why it's so slow? Any suggestions would be really appreciated. One of the things I noticed is that on the admin page of map task list I see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but for pig the status is blank.
Re: Browse the filesystem weblink broken after upgrade to 1.0.0: HTTP 404 Problem accessing /browseDirectory.jsp
I can do perform HDFS operations from the command line like hadoop fs -ls /. Doesn't that meant that the datanode is up?
Re: Should splittable Gzip be a core hadoop feature?
If many people are going to use it then by all means put it in. If there is only one person, or a very small handful of people that are going to use it then I personally would prefer to see it a separate project. However, Edward, you have convinced me that I am trying to make a logical judgment based only on a gut feeling and the response rate to an email chain. Thanks for that. What I really want to know is how well does this new CompressionCodec perform in comparison to the regular gzip codec in various different conditions and what type of impact does it have on network traffic and datanode load. My gut feeling is that the speedup is going to be relatively small except when there is a lot of computation happening in the mapper and the added load and network traffic outweighs the speedup in most cases, but like all performance on a complex system gut feelings are almost worthless and hard numbers are what is needed to make a judgment call. Niels, I assume you have tested this on your cluster(s). Can you share with us some of the numbers? --Bobby Evans On 2/29/12 11:06 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Too bad we can not up the replication on the first few blocks of the file or distributed cache it. The crontrib statement is arguable. I could make a case that the majority of stuff should not be in hadoop-core. NLineInputFormat for example, nice to have. Took a long time to get ported to the new map reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but does not need to be part of core. I could see hadoop as just coming with TextInputFormat and SequenceInputFormat and everything else is after market from github, On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans ev...@yahoo-inc.com wrote: I can see a use for it, but I have two concerns about it. My biggest concern is maintainability. We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot. I am not saying that it will happen with this, but if you have to ask if people will use it and there has been no overwhelming yes, it makes me nervous about it. My second concern is with knowing when to use this. Anything that adds this in would have to come with plenty of documentation about how it works, how it is different from the normal gzip format, explanations about what type of a load it might put on data nodes that hold the start of the file, etc. From both of these I would prefer to see this as a github project for a while first, and one it shows that it has a significant following, or a community with it, then we can pull it in. But if others disagree I am not going to block it. I am a -0 on pulling this in now. --Bobby On 2/29/12 10:00 AM, Niels Basjes ni...@basjes.nl wrote: Hi, On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote: ... But being able to generate split info for them and processing them would be good as well. I remember that was a hot thing to do with lzo back in the day. The pain of once overing the gz files to generate the split info is detracting but it is nice to know it is there if you want it. Note that the solution I created (HADOOP-7076) does not require any preprocessing. It can split ANY gzipped file as-is. The downside is that this effectively costs some additional performance because the task has to decompress the first part of the file that is to be discarded. The other two ways of splitting gzipped files either require - creating come kind of compression index before actually using the file (HADOOP-6153) - creating a file in a format that is gerenated in such a way that it is really a set of concatenated gzipped files. (HADOOP-7909) -- Best regards / Met vriendelijke groeten, Niels Basjes
Streaming Hadoop using C
Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
The documentation on Starfish http://www.cs.duke.edu/starfish/index.html looks promising , I have not used it. I wonder if others on the list have found it more useful than setting mapred.task.profile. C On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: 100x slower mapreduce compared to pig
I think I've found the problem. There was one line of code that caused this issue :) that was output.collect(key, value); I had to add more logging to the code to get to it. For some reason kill -QUIT didn't send the stacktrace to the userLogs/job/attempt/syslog , I searched all the logs and couldn't find one. Does anyone know where stacktraces are generally sent? On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I can't seem to find what's causing this slowness. Nothing in the logs. It's just painfuly slow. However, pig job is awesome in performance that has the same logic. Here is the mapper code and the pig code: *public* *static* *class* Map *extends* MapReduceBase *implements* MapperText, Text, Text, Text { *public* *void* map(Text key, Text value, OutputCollectorText, Text output, Reporter reporter) *throws* IOException { String line = value.toString(); //log.info(output key: + key + value + value + value + line); FormMLType f; *try* { f = FormMLUtils.*convertToRows*(line); FormMLStack fm = *new* FormMLStack(f,key.toString()); fm.parseFormML(); *for* (String *row* : fm.getFormattedRecords(*false*)){ output.collect(key, value); } } *catch* (JAXBException e) { *log*.error(Error processing record + key, e); } } } And here is the pig udf: *public* DataBag exec(Tuple input) *throws* IOException { *try* { DataBag output = mBagFactory.newDefaultBag(); Object o = input.get(1); *if* (!(o *instanceof* String)) { *throw* *new* IOException( Expected document input to be chararray, but got + o.getClass().getName()); } Object o1 = input.get(0); *if* (!(o1 *instanceof* String)) { *throw* *new* IOException( Expected input to be chararray, but got + o.getClass().getName()); } String document = (String)o; String filename = (String)o1; FormMLType f = FormMLUtils.*convertToRows*(document); FormMLStack fm = *new* FormMLStack(f,filename); fm.parseFormML(); *for* (String row : fm.getFormattedRecords(*false*)){ output.add( mTupleFactory.newTuple(row)); } *return* output; } *catch* (ExecException ee) { log.error(Failed to Process , ee); *throw* ee; } *catch* (JAXBException e) { // *TODO* Auto-generated catch block log.error(Invalid xml, e); *throw* *new* IllegalArgumentException(invalid xml + e.getCause().getMessage()); } } On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia mohitanch...@gmail.comwrote: I am going to try few things today. I have a JAXBContext object that marshals the xml, this is static instance but my guess at this point is that since this is in separate jar then the one where job runs and I used DistributeCache.addClassPath this context is being created on every call for some reason. I don't know why that would be. I am going to create this instance as static in the mapper class itself and see if that helps. I also add debugs. Will post the results after try it out. On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.com wrote: It would be great if we can take a look at what you are doing in the UDF vs the Mapper. 100x slow does not make sense for the same job/logic, its either the Mapper code or may be the cluster was busy at the time you scheduled MapReduce job? Thanks, Prashant On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting job like: java -classpath .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar com.services.dp.analytics.hadoop.mapred.FormMLProcessor /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq /examples/output1/ How should I go about looking the root cause of why it's so slow? Any suggestions would be really appreciated. One of the things I noticed is that on the admin page of map task list I see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but for pig the status is blank.
Re: Invocation exception
Thanks for the example. I did look at the logs and also at the admin page and all I see is the exception that I posted initially. I am not sure why adding an extra jar to the classpath in DistributedCache causes that exception. I tried to look at Configuration code in hadoop.util package but it doesn't tell much. It looks like it's throwing on this line configureMethod.invoke(theObject, conf); in below code. *private* *static* *void* setJobConf(Object theObject, Configuration conf) { //If JobConf and JobConfigurable are in classpath, AND //theObject is of type JobConfigurable AND //conf is of type JobConf then //invoke configure on theObject *try* { Class? jobConfClass = conf.getClassByName(org.apache.hadoop.mapred.JobConf); Class? jobConfigurableClass = conf.getClassByName(org.apache.hadoop.mapred.JobConfigurable); *if* (jobConfClass.isAssignableFrom(conf.getClass()) jobConfigurableClass.isAssignableFrom(theObject.getClass())) { Method configureMethod = jobConfigurableClass.getMethod(configure, jobConfClass); configureMethod.invoke(theObject, conf); } } *catch* (ClassNotFoundException e) { //JobConf/JobConfigurable not in classpath. no need to configure } *catch* (Exception e) { *throw* *new* RuntimeException(Error in configuring object, e); } } On Tue, Feb 28, 2012 at 9:25 PM, Harsh J ha...@cloudera.com wrote: Mohit, If you visit the failed task attempt on the JT Web UI, you can see the complete, informative stack trace on it. It would point the exact line the trouble came up in and what the real error during the configure-phase of task initialization was. A simple attempts page goes like the following (replace job ID and task ID of course): http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964tipid=task_201202041249_3964_m_00 Once there, find and open the All logs link to see stdout, stderr, and syslog of the specific failed task attempt. You'll have more info sifting through this to debug your issue. This is also explained in Tom's book under the title Debugging a Job (p154, Hadoop: The Definitive Guide, 2nd ed.). On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote: It looks like adding this line causes invocation exception. I looked in hdfs and I see that file in that path DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); I have similar code for another jar DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); but this works just fine. On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting invocation exception and I don't see any more details other than this exception: My job is configured as: JobConf conf = *new* JobConf(FormMLProcessor.*class*); conf.addResource(hdfs-site.xml); conf.addResource(core-site.xml); conf.addResource(mapred-site.xml); conf.set(mapred.reduce.tasks, 0); conf.setJobName(mlprocessor); DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); conf.setOutputKeyClass(Text.*class*); conf.setOutputValueClass(Text.*class*); conf.setMapperClass(Map.*class*); conf.setCombinerClass(Reduce.*class*); conf.setReducerClass(IdentityReducer.*class*); Why would you set the Reducer when the number of reducers is set to zero. Not sure if this is the real cause. conf.setInputFormat(SequenceFileAsTextInputFormat.*class*); conf.setOutputFormat(TextOutputFormat.*class*); FileInputFormat.*setInputPaths*(conf, *new* Path(args[0])); FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1])); JobClient.*runJob*(conf); - * java.lang.RuntimeException*: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(* ReflectionUtils.java:93*) at org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*) at org.apache.hadoop.util.ReflectionUtils.newInstance(* ReflectionUtils.java:117*) at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*) at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*) at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*) at java.security.AccessController.doPrivileged(*Native Method*) at javax.security.auth.Subject.doAs(*Subject.java:396*) at org.apache.hadoop.security.UserGroupInformation.doAs(* UserGroupInformation.java:1157*) at org.apache.hadoop.mapred.Child.main(*Child.java:264*) Caused
Re: Invocation exception
Mohit, I'm positive the real exception lies a few scrolls below that message on the attempt page. Possibly a class not found issue. The message you see on top is when something throws up an exception while being configure()-ed. It is most likely a job config or setup-time issue from your code or from the library code. On Thu, Mar 1, 2012 at 5:19 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks for the example. I did look at the logs and also at the admin page and all I see is the exception that I posted initially. I am not sure why adding an extra jar to the classpath in DistributedCache causes that exception. I tried to look at Configuration code in hadoop.util package but it doesn't tell much. It looks like it's throwing on this line configureMethod.invoke(theObject, conf); in below code. *private* *static* *void* setJobConf(Object theObject, Configuration conf) { //If JobConf and JobConfigurable are in classpath, AND //theObject is of type JobConfigurable AND //conf is of type JobConf then //invoke configure on theObject *try* { Class? jobConfClass = conf.getClassByName(org.apache.hadoop.mapred.JobConf); Class? jobConfigurableClass = conf.getClassByName(org.apache.hadoop.mapred.JobConfigurable); *if* (jobConfClass.isAssignableFrom(conf.getClass()) jobConfigurableClass.isAssignableFrom(theObject.getClass())) { Method configureMethod = jobConfigurableClass.getMethod(configure, jobConfClass); configureMethod.invoke(theObject, conf); } } *catch* (ClassNotFoundException e) { //JobConf/JobConfigurable not in classpath. no need to configure } *catch* (Exception e) { *throw* *new* RuntimeException(Error in configuring object, e); } } On Tue, Feb 28, 2012 at 9:25 PM, Harsh J ha...@cloudera.com wrote: Mohit, If you visit the failed task attempt on the JT Web UI, you can see the complete, informative stack trace on it. It would point the exact line the trouble came up in and what the real error during the configure-phase of task initialization was. A simple attempts page goes like the following (replace job ID and task ID of course): http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964tipid=task_201202041249_3964_m_00 Once there, find and open the All logs link to see stdout, stderr, and syslog of the specific failed task attempt. You'll have more info sifting through this to debug your issue. This is also explained in Tom's book under the title Debugging a Job (p154, Hadoop: The Definitive Guide, 2nd ed.). On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote: It looks like adding this line causes invocation exception. I looked in hdfs and I see that file in that path DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); I have similar code for another jar DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); but this works just fine. On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting invocation exception and I don't see any more details other than this exception: My job is configured as: JobConf conf = *new* JobConf(FormMLProcessor.*class*); conf.addResource(hdfs-site.xml); conf.addResource(core-site.xml); conf.addResource(mapred-site.xml); conf.set(mapred.reduce.tasks, 0); conf.setJobName(mlprocessor); DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); conf.setOutputKeyClass(Text.*class*); conf.setOutputValueClass(Text.*class*); conf.setMapperClass(Map.*class*); conf.setCombinerClass(Reduce.*class*); conf.setReducerClass(IdentityReducer.*class*); Why would you set the Reducer when the number of reducers is set to zero. Not sure if this is the real cause. conf.setInputFormat(SequenceFileAsTextInputFormat.*class*); conf.setOutputFormat(TextOutputFormat.*class*); FileInputFormat.*setInputPaths*(conf, *new* Path(args[0])); FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1])); JobClient.*runJob*(conf); - * java.lang.RuntimeException*: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(* ReflectionUtils.java:93*) at org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*) at org.apache.hadoop.util.ReflectionUtils.newInstance(* ReflectionUtils.java:117*) at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*) at
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Varun sorry for my late response. Today I have deployed a new version and I can confirm that patches you provided works well. I' ve been running some jobs on a 5node cluster for an hour without a core on full load so now thinks works as expected. Thank you again! I have used just your first option.. On 15 February 2012 19:53, mete efk...@gmail.com wrote: Well rebuilding ganglia seemed easier and Merto was testing the other so i though that i should give that one a chance :) anyway i will send you gdb details or patch hadoop and try it at my earliest convenience Cheers On Wed, Feb 15, 2012 at 6:59 PM, Varun Kapoor rez...@hortonworks.com wrote: The warnings about underflow are totally expected (they come from strtod(), and they will no longer occur with Hadoop-1.0.1, which applies my patch from HADOOP-8052), so that's not worrisome. As for the buffer overflow, do you think you could show me a backtrace of this core? If you can't find the core file on disk, just start gmetad under gdb, like so: $ sudo gdb path to gmetad (gdb) r --conf=path to your gmetad.conf ... ::Wait for crash:: (gdb) bt (gdb) info locals If you're familiar with gdb, then I'd appreciate any additional diagnosis you could perform (for example, to figure out which metric's value caused this buffer overflow) - if you're not, I'll try and send you some gdb scripts to narrow things down once I see the output from this round of debugging. Also, out of curiosity, is patching Hadoop not an option for you? Or is it just that rebuilding (and redeploying) ganglia is the lesser of the 2 evils? :) Varun On Tue, Feb 14, 2012 at 11:43 PM, mete efk...@gmail.com wrote: Hello Varun, i have patched and recompiled ganglia from source bit it still cores after the patch. Here are some logs: Feb 15 09:39:14 master gmetad[16487]: RRD_update (/var/lib/ganglia/rrds/hadoop/slave4/metricssystem.MetricsSystem.publish_max_time.rrd): /var/lib/ganglia/rrds/hadoop/slave4/metricssystem.MetricsSystem.publish_max_time.rrd: converting '4.9E-324' to float: Numerical result out of range Feb 15 09:39:14 master gmetad[16487]: RRD_update (/var/lib/ganglia/rrds/hadoop/master/metricssystem.MetricsSystem.publish_imax_time.rrd): /var/lib/ganglia/rrds/hadoop/master/metricssystem.MetricsSystem.publish_imax_time.rrd: converting '4.9E-324' to float: Numerical result out of range Feb 15 09:39:14 master gmetad[16487]: RRD_update (/var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_imax_time.rrd): /var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_imax_time.rrd: converting '4.9E-324' to float: Numerical result out of range Feb 15 09:39:14 master gmetad[16487]: RRD_update (/var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.snapshot_imax_time.rrd): /var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.snapshot_imax_time.rrd: converting '4.9E-324' to float: Numerical result out of range Feb 15 09:39:14 master gmetad[16487]: RRD_update (/var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_max_time.rrd): /var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_max_time.rrd: converting '4.9E-324' to float: Numerical result out of range Feb 15 09:39:14 master gmetad[16487]: *** buffer overflow detected ***: gmetad terminated i am using hadoop.1.0.0 and ganglia 3.20 tarball. Cheers Mete On Sat, Feb 11, 2012 at 2:19 AM, Merto Mertek masmer...@gmail.com wrote: Varun unfortunately I have had some problems with deploying a new version on the cluster.. Hadoop is not picking the new build in lib folder despite a classpath is set to it. The new build is picked just if I put it in the $HD_HOME/share/hadoop/, which is very strange.. I've done this on all nodes and can access the web, but all tasktracker are being stopped because of an error: INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926) Probably the error is the consequence of an inadequate deploy of a jar.. I will ask to the dev list how they do it or are you maybe having any other idea? On 10 February 2012 17:10, Varun Kapoor rez...@hortonworks.com wrote: Hey Merto, Any luck getting the patch running on your cluster? In case you're interested, there's now a JIRA for this: https://issues.apache.org/jira/browse/HADOOP-8052. Varun On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor
Re: Streaming Hadoop using C
Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark