Re: Best practices on spliltting an input line?
Hi, Andy Your problem seems to be a general Java problem, rather than hadoop. In a java forum you may get better help. String.split uses regular expressions, which you definitely don't need. I would write my own split function, without regular expressions. This link may help to better understand underlying operations: http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split Also there is a constructor of StringTokenizer to return also delimeters: StringTokenizer(String string, String delimeters, boolean returnDelimeters); (I would write my own, though.) Rasit 2009/2/10 Andy Sautins andy.saut...@returnpath.net: I have question. I've dabbled with different ways of tokenizing an input file line for processing. I've noticed in my somewhat limited tests that there seem to be some pretty reasonable performance differences between different tokenizing methods. For example, roughly it seems to split a line on tokens ( tab delimited in my case ) that Scanner is the slowest, followed by String.spit and StringTokenizer being the fastest. StringTokenizer, for my application, has the unfortunate characteristic of not returning blank tokens ( i.e., parsing a,b,c,,d would return a,b,c,d instead of a,b,c,,d). The WordCount example uses StringTokenizer which makes sense to me, except I'm currently getting hung up on not returning blank tokens. I did run across the com.Ostermiller.util StringTokenizer replacement that handles null/blank tokens (http://ostermiller.org/utils/StringTokenizer.html ) which seems possible to use, but it sure seems like someone else has solved this problem already better than I have. So, my question is, is there a best practice for splitting an input line especially when NULL tokens are expected ( i.e., two consecutive delimiter characters )? Any thoughts would be appreciated Thanks Andy -- M. Raşit ÖZDAŞ
HDFS on non-identical nodes
Hi, We're running Hadoop cluster on 4 nodes, our primary purpose of running is to provide distributed storage solution for internal applications here in TellyTopia Inc. Our cluster consists of non-identical nodes (one with 1TB another two with 3 TB and one more with 60GB) while copying data on HDFS we noticed that node with 60GB storage ran out of disk-space and even balancer couldn't balance because cluster was stopped. Now my questions are 1. Is Hadoop is suitable for non-identical cluster nodes? 2. Is there any way to automatically balancing of nodes? 3. Why Hadoop cluster stops when one node ran our of disk? Any futher inputs are appericiapted! Cheers, Deepak TellyTopia Inc.
Re: stable version
Iam working on Hadoop SVN version 0.21.0-dev. Having some problems , regarding running its examples/file from eclipse. It gives error for Exception in thread main java.lang.UnsupportedOperationException: This parser does not support specification null version null at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) Can anyone reslove or give some idea about it. Thanks. 2009/2/12 Raghu Angadi rang...@yahoo-inc.com Vadim Zaliva wrote: The particular problem I am having is this one: https://issues.apache.org/jira/browse/HADOOP-2669 I am observing it in version 19. Could anybody confirm that it have been fixed in 18, as Jira claims? I am wondering why bug fix for this problem might have been committed to 18 branch but not 19. If it was commited to both, then perhaps the problem was not completely solved and downgrading to 18 will not help me. If you read through the comments, it will see that the the root cause was never found. The patch just fixes one of the suspects. If you are still seeing this, please file another jira and link it HADOOP-2669. How easy is it for you reproduce this? I guess one of the reasons for incomplete diagnosis is that it is not simple to reproduce. Raghu. Vadim On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote: Yes, version 18.3 is the most stable one. It has added patches, without not-proven new functionality. 2009/2/11 Owen O'Malley omal...@apache.org: On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote: Maybe version 0.18 is better suited for production environment? Yahoo is mostly on 0.18.3 + some patches at this point. -- Owen -- M. Raşit ÖZDAŞ
Re: Backing up HDFS?
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer a...@yahoo-inc.com wrote: The key here is to prioritize your data. Impossible to replicate data gets backed up using whatever means necessary, hard-to-regenerate data, next priority. Easy to regenerate and ok to nuke data, doesn't get backed up. I think thats a good advise to start with when creating a backup strategy. E.g. what we do at the moment is to analyze huge volumes of access logs where we import those logs into hdfs, creating aggregates for several metrics and finally storing results in sequence files using block level compression. Its kind of an intermediate format that can be used for further analysis. Those files end up being pretty small and will be exported daily to storage and getting backuped. In case hdfs goes to hell we can restore some raw log data from the servers and only loose historical logs which should not be a big deal. I must also add that I really enjoy the great deal of optimization opportunities that hadoop gives you by directly implementing the serialization strategies. You really get control over every bit and byte that gets recorded. Same with compression. So you can make the best trade offs possible and finally store only data you really need.
Re: stable version
Anum Ali wrote: Iam working on Hadoop SVN version 0.21.0-dev. Having some problems , regarding running its examples/file from eclipse. It gives error for Exception in thread main java.lang.UnsupportedOperationException: This parser does not support specification null version null at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) Can anyone reslove or give some idea about it. You are using Java6, correct?
Re: Best practices on spliltting an input line?
Stefan Podkowinski wrote: I'm currently using OpenCSV which can be found at http://opencsv.sourceforge.net/ but haven't done any performance tests on it yet. In my case simply splitting strings would not work anyways, since I need to handle quotes and separators within quoted values, e.g. a,a,b,c. I've used it in the past; found it pretty reliable. Again, no perf tests, just reading in CSV files exported from spreadsheets
how to use FileSystem Object API
I m trying to run following code public class localtohdfs { public static void main() { Configuration config = new Configuration(); FileSystem hdfs = FileSystem.get(config); Path srcPath = new Path(/root/testfile); Path dstPath = new Path(testfile_hadoop); hdfs.copyFromLocalFile(srcPath, dstPath); } } when I do javac i see [r...@nlb-2 hadoop-0.17.2.1]# javac localtohdfs.java localtohdfs.java:3: package org.apache.hadoop does not exist import org.apache.hadoop.*; ^ localtohdfs.java:7: cannot find symbol symbol : class Configuration location: class localtohdfs Configuration config = new Configuration(); ^ localtohdfs.java:7: cannot find symbol symbol : class Configuration location: class localtohdfs Configuration config = new Configuration(); ^ localtohdfs.java:8: cannot find symbol symbol : class FileSystem location: class localtohdfs FileSystem hdfs = FileSystem.get(config); ^ localtohdfs.java:8: cannot find symbol symbol : variable FileSystem location: class localtohdfs FileSystem hdfs = FileSystem.get(config); ^ localtohdfs.java:9: cannot find symbol symbol : class Path location: class localtohdfs Path srcPath = new Path(/root/testfile); ^ localtohdfs.java:9: cannot find symbol symbol : class Path location: class localtohdfs Path srcPath = new Path(/root/testfile); ^ localtohdfs.java:10: cannot find symbol symbol : class Path location: class localtohdfs Path dstPath = new Path(testfile_hadoop); ^ localtohdfs.java:10: cannot find symbol symbol : class Path location: class localtohdfs Path dstPath = new Path(testfile_hadoop); ^ 9 errors CLASS_PATH=/root/newhadoop/hadoop-0.17.2.1/src/java:/root/newhadoop/hadoop-0.17.2.1/conf:/usr/java/jdk1.6.0/lib PATH=/usr/java/jdk1.6.0/bin:/usr/local/apache-ant-1.7.1/bin:/root/newhadoop/hadoop-0.17.2.1/bin:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin HADOOP_HOME=/root/newhadoop/hadoop-0.17.2.1 JAVA_HOME=/usr/java/jdk1.6.0/ Where is the problem? thanks and regards!
Re: how to use FileSystem Object API
you have to specify a hadoop-xxx-core.jar file. for example: javac -cp hadoop-xxx-core.jar localtohdfs.java On Thu, Feb 12, 2009 at 7:58 PM, Ved Prakash meramailingl...@gmail.comwrote: I m trying to run following code public class localtohdfs { public static void main() { Configuration config = new Configuration(); FileSystem hdfs = FileSystem.get(config); Path srcPath = new Path(/root/testfile); Path dstPath = new Path(testfile_hadoop); hdfs.copyFromLocalFile(srcPath, dstPath); } } when I do javac i see [r...@nlb-2 hadoop-0.17.2.1]# javac localtohdfs.java localtohdfs.java:3: package org.apache.hadoop does not exist import org.apache.hadoop.*; ^ localtohdfs.java:7: cannot find symbol symbol : class Configuration location: class localtohdfs Configuration config = new Configuration(); ^ localtohdfs.java:7: cannot find symbol symbol : class Configuration location: class localtohdfs Configuration config = new Configuration(); ^ localtohdfs.java:8: cannot find symbol symbol : class FileSystem location: class localtohdfs FileSystem hdfs = FileSystem.get(config); ^ localtohdfs.java:8: cannot find symbol symbol : variable FileSystem location: class localtohdfs FileSystem hdfs = FileSystem.get(config); ^ localtohdfs.java:9: cannot find symbol symbol : class Path location: class localtohdfs Path srcPath = new Path(/root/testfile); ^ localtohdfs.java:9: cannot find symbol symbol : class Path location: class localtohdfs Path srcPath = new Path(/root/testfile); ^ localtohdfs.java:10: cannot find symbol symbol : class Path location: class localtohdfs Path dstPath = new Path(testfile_hadoop); ^ localtohdfs.java:10: cannot find symbol symbol : class Path location: class localtohdfs Path dstPath = new Path(testfile_hadoop); ^ 9 errors CLASS_PATH=/root/newhadoop/hadoop-0.17.2.1/src/java:/root/newhadoop/hadoop-0.17.2.1/conf:/usr/java/jdk1.6.0/lib PATH=/usr/java/jdk1.6.0/bin:/usr/local/apache-ant-1.7.1/bin:/root/newhadoop/hadoop-0.17.2.1/bin:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin HADOOP_HOME=/root/newhadoop/hadoop-0.17.2.1 JAVA_HOME=/usr/java/jdk1.6.0/ Where is the problem? thanks and regards!
Re: stable version
yes On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote: Anum Ali wrote: Iam working on Hadoop SVN version 0.21.0-dev. Having some problems , regarding running its examples/file from eclipse. It gives error for Exception in thread main java.lang.UnsupportedOperationException: This parser does not support specification null version null at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) Can anyone reslove or give some idea about it. You are using Java6, correct?
Re: Finding small subset in very large dataset
Thanks, I didn't think about the bloom filter variant. That's the solution I was looking for :-) Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
Bloom Filters are one of the greatest things ever, so it is nice to see another application. Remember that your filter may make mistakes -you will see items that are not in the set. Also, instead of setting a single bit per item (in the A set), set k distinct bits. You can analytically work-out the best k for a given number of items and for some amount of memory. In practice, this usually boils-down to k being 3 or so for a reasonable error rate. Happy hunting Miles 2009/2/12 Thibaut_ tbr...@blue.lu: Thanks, I didn't think about the bloom filter variant. That's the solution I was looking for :-) Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Re: Re: Re: Re: Re: Regarding Hadoop multi cluster set-up
I changed the value... It is still not working! Shefali On Tue, 10 Feb 2009 22:23:10 +0530 wrote in hadoop-site.xml change master:54311 to hdfs://master:54311 --nitesh On Tue, Feb 10, 2009 at 9:50 PM, shefali pawar wrote: I tried that, but it is not working either! Shefali On Sun, 08 Feb 2009 05:27:54 +0530 wrote I ran into this trouble again. This time, formatting the namenode didnt help. So, I changed the directories where the metadata and the data was being stored. That made it work. You might want to check this up at your end too. Amandeep PS: I dont have an explanation for how and why this made it work. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sat, Feb 7, 2009 at 9:06 AM, jason hadoop wrote: On your master machine, use the netstat command to determine what ports and addresses the namenode process is listening on. On the datanode machines, examine the log files,, to verify that the datanode has attempted to connect to the namenode ip address on one of those ports, and was successfull. The common ports used for datanode - namenode rondevu are 50010, 54320 and 8020, depending on your hadoop version If the datanodes have been started, and the connection to the namenode failed, there will be a log message with a socket error, indicating what host and port the datanode used to attempt to communicate with the namenode. Verify that that ip address is correct for your namenode, and reachable from the datanode host (for multi homed machines this can be an issue), and that the port listed is one of the tcp ports that the namenode process is listing on. For linux, you can use command *netstat -a -t -n -p | grep java | grep LISTEN* to determine the ip addresses and ports and pids of the java processes that are listening for tcp socket connections and the jps command from the bin directory of your java installation to determine the pid of the namenode. On Sat, Feb 7, 2009 at 6:27 AM, shefali pawar wrote: Hi, No, not yet. We are still struggling! If you find the solution please let me know. Shefali On Sat, 07 Feb 2009 02:56:15 +0530 wrote I had to change the master on my running cluster and ended up with the same problem. Were you able to fix it at your end? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Thu, Feb 5, 2009 at 8:46 AM, shefali pawar wrote: Hi, I do not think that the firewall is blocking the port because it has been turned off on both the computers! And also since it is a random port number I do not think it should create a problem. I do not understand what is going wrong! Shefali On Wed, 04 Feb 2009 23:23:04 +0530 wrote I'm not certain that the firewall is your problem but if that port is blocked on your master you should open it to let communication through. Here is one website that might be relevant: http://stackoverflow.com/questions/255077/open-ports-under-fedora-core-8-for-vmware-server but again, this may not be your problem. John On Wed, Feb 4, 2009 at 12:46 PM, shefali pawar wrote: Hi, I will have to check. I can do that tomorrow in college. But if that is the case what should i do? Should i change the port number and try again? Shefali On Wed, 04 Feb 2009 S D wrote : Shefali, Is your firewall blocking port 54310 on the master? John On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar wrote: Hi, I am trying to set-up a two node cluster using Hadoop0.19.0, with 1 master(which should also work as a slave) and 1 slave node. But while running bin/start-dfs.sh the datanode is not starting on the slave. I had read the previous mails on the list, but nothing seems to be working in this case. I am getting the following error in the hadoop-root-datanode-slave log file while running the command bin/start-dfs.sh = 2009-02-03 13:00:27,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = slave/172.16.0.32 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19-r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008 / 2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
Re: HDFS on non-identical nodes
On Feb 12, 2009, at 2:54 AM, Deepak wrote: Hi, We're running Hadoop cluster on 4 nodes, our primary purpose of running is to provide distributed storage solution for internal applications here in TellyTopia Inc. Our cluster consists of non-identical nodes (one with 1TB another two with 3 TB and one more with 60GB) while copying data on HDFS we noticed that node with 60GB storage ran out of disk-space and even balancer couldn't balance because cluster was stopped. Now my questions are 1. Is Hadoop is suitable for non-identical cluster nodes? Yes. Our cluster has between 60GB and 40TB on our nodes. The majority have around 3TB. 2. Is there any way to automatically balancing of nodes? We have a cron script which automatically starts the Balancer. It's dirty, but it works. 3. Why Hadoop cluster stops when one node ran our of disk? That's not normal. Trust me, if that was always true, we'd be perpetually screwed :) There might be some other underlying error you're missing... Brian Any futher inputs are appericiapted! Cheers, Deepak TellyTopia Inc.
Re: what's going on :( ?
I see it is picking up other parameters from config, so my hypothesis is that in 0.19 the file system is listening on 8020. I went back to 18.3 and also did not change this hdfs port this time, preempting the question, so I am fine for now. Mark On Thu, Feb 12, 2009 at 1:40 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Hi, Mark Try to add an extra property to that file, and try to examine if hadoop recognizes it. This way you can find out if hadoop uses your configuration file. 2009/2/10 Jeff Hammerbacher ham...@cloudera.com: Hey Mark, In NameNode.java, the DEFAULT_PORT specified for NameNode RPC is 8020. From my understanding of the code, your fs.default.name setting should have overridden this port to be 9000. It appears your Hadoop installation has not picked up the configuration settings appropriately. You might want to see if you have any Hadoop processes running and terminate them (bin/stop-all.sh should help) and then restart your cluster with the new configuration to see if that helps. Later, Jeff On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote: Mark Kerzner wrote: Hi, Hi, why is hadoop suddenly telling me Retrying connect to server: localhost/127.0.0.1:8020 with this configuration configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property property namemapred.job.tracker/name valuelocalhost:9001/value Shouldnt this be valuehdfs://localhost:9001/value Amar /property property namedfs.replication/name value1/value /property /configuration and both this http://localhost:50070/dfshealth.jsp and this http://localhost:50030/jobtracker.jsp links work fine? Thank you, Mark -- M. Raşit ÖZDAŞ
Re: HDFS on non-identical nodes
I think you should confirm your balancer is still running. Do you change the threshold of the HDFS balancer? May be too large? The balancer will stop working when meets 5 conditions: 1. Datanodes are balanced (obviously you are not this kind); 2. No more block to be moved (all blocks on unbalanced nodes are busy or recently used) 3. No more block to be moved in 20 minutes and 5 times consecutive attempts 4. Another balancer is working 5. I/O exception The default setting is 10% for each datanodes, for 1TB it is 100GB, for 3T is 300GB, and for 60GB is 6GB Hope helpful On Thu, Feb 12, 2009 at 10:06 AM, Brian Bockelman bbock...@cse.unl.eduwrote: On Feb 12, 2009, at 2:54 AM, Deepak wrote: Hi, We're running Hadoop cluster on 4 nodes, our primary purpose of running is to provide distributed storage solution for internal applications here in TellyTopia Inc. Our cluster consists of non-identical nodes (one with 1TB another two with 3 TB and one more with 60GB) while copying data on HDFS we noticed that node with 60GB storage ran out of disk-space and even balancer couldn't balance because cluster was stopped. Now my questions are 1. Is Hadoop is suitable for non-identical cluster nodes? Yes. Our cluster has between 60GB and 40TB on our nodes. The majority have around 3TB. 2. Is there any way to automatically balancing of nodes? We have a cron script which automatically starts the Balancer. It's dirty, but it works. 3. Why Hadoop cluster stops when one node ran our of disk? That's not normal. Trust me, if that was always true, we'd be perpetually screwed :) There might be some other underlying error you're missing... Brian Any futher inputs are appericiapted! Cheers, Deepak TellyTopia Inc. -- Chen He RCF CSE Dept. University of Nebraska-Lincoln US
Re: stable version
Anum Ali wrote: yes On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote: Anum Ali wrote: Iam working on Hadoop SVN version 0.21.0-dev. Having some problems , regarding running its examples/file from eclipse. It gives error for Exception in thread main java.lang.UnsupportedOperationException: This parser does not support specification null version null at javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) Can anyone reslove or give some idea about it. You are using Java6, correct? well, in that case something being passed down to setXIncludeAware may be picked up as invalid. More of a stack trace may help. Otherwise, now is your chance to learn your way around the hadoop codebase, and ensure that when the next version ships, your most pressing bugs have been fixed
RE: How to use DBInputFormat?
Amandeep, I spoke w/ one of our Oracle DBA's and he suggested changing the query statement as follows: MySql Stmt: select * from TABLE limit splitlength offset splitstart --- Oracle Stmt: select * from (select a.*,rownum rno from (your_query_here must contain order by) a where rownum = splitstart + splitlength) where rno = splitstart; This can be put into a function, but would require a type as well. - If you edit org.apache.hadoop.mapred.lib.db.DBInputFormat, getSelectQuery, it should work in Oracle protected String getSelectQuery() { ... edit to include check for driver and create Oracle Stmt return query.toString(); } Brian == On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a limit and offset parameter appended to the input sql query Effectively your input query select * from orders would become select * from orders limit splitlength offset splitstart and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana ama...@gmail.com: Adding a semicolon gives me the error ORA-00911: Invalid character Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Amandeep, SQL command not properly ended I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana ama...@gmail.com: The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Exception closing file /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084) at
Too many open files in 0.18.3
Hi all, I'm continually running into the Too many open files error on 18.3: DataXceiveServer: java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) I'm writing thousands of files in the course of a few minutes, but nothing that seems too unreasonable, especially given the numbers below. I begin getting a surge of these warnings right as I hit 1024 files open by the DataNode: had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls /proc/{}/fd | wc -l 1023 This is a bit unexpected, however, since I've configured my open file limit to be 16k: had...@u10:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 268288 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 268288 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml. Thanks in advance, Sean
Re: Too many open files in 0.18.3
I once had too many open files when I was opening too many sockets and not closing them... On Thu, Feb 12, 2009 at 1:56 PM, Sean Knapp s...@ooyala.com wrote: Hi all, I'm continually running into the Too many open files error on 18.3: DataXceiveServer: java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) I'm writing thousands of files in the course of a few minutes, but nothing that seems too unreasonable, especially given the numbers below. I begin getting a surge of these warnings right as I hit 1024 files open by the DataNode: had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls /proc/{}/fd | wc -l 1023 This is a bit unexpected, however, since I've configured my open file limit to be 16k: had...@u10:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 268288 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 268288 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml. Thanks in advance, Sean
Eclipse plugin
Hi, I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of hadoop. I have followed all the steps in this tutorial: http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My problem is that I am not able to browse the HDFS. It only shows an entry Error:null. Upload files to DFS, and Create new directory fail. Any suggestions? I have tried to chang all the directories in the hadoop location advanced parameters to /tmp/hadoop-user, but it did not work. Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be changed, but I could not find it in the list of parameters. Thanks Iman
Measuring IO time in map/reduce jobs?
Hey all, Does anyone have any experience trying to measure IO time spent in their map/reduce jobs? I know how to profile a sample of map and reduce tasks, but that appears to exclude IO time. Just subtracting the total cpu time from the total run time of a task seems like too coarse an approach. -Bryan
Re: Eclipse plugin
Are running Eclipse on Windows? If so, be aware that you need to spawn Eclipse from within Cygwin in order to access HDFS. It seems that the plugin uses whoami to get info about the active user. This thread has some more info: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e Norbert On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote: Hi, I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of hadoop. I have followed all the steps in this tutorial: http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My problem is that I am not able to browse the HDFS. It only shows an entry Error:null. Upload files to DFS, and Create new directory fail. Any suggestions? I have tried to chang all the directories in the hadoop location advanced parameters to /tmp/hadoop-user, but it did not work. Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be changed, but I could not find it in the list of parameters. Thanks Iman
Hadoop User Group Meeting (Bay Area) 2/18
The next Bay Area Hadoop User Group meeting is scheduled for Wednesday, February 18th at Yahoo! 2811 Mission College Blvd, Santa Clara, Building 2, Training Rooms 5 6 from 6:00-7:30 pm. Agenda: Fair Scheduler for Hadoop - Matei Zaharia Interfacing with MySQL - Aaron Kimball Registration: http://upcoming.yahoo.com/event/1776616/ As always, suggestions for topics for future meetings are welcome. Please send them to me directly at aan...@yahoo-inc.com Look forward to seeing you there! Ajay
Re: Eclipse plugin
Thank you so much, Norbert. It worked. Iman Norbert Burger wrote: Are running Eclipse on Windows? If so, be aware that you need to spawn Eclipse from within Cygwin in order to access HDFS. It seems that the plugin uses whoami to get info about the active user. This thread has some more info: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e Norbert On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote: Hi, I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of hadoop. I have followed all the steps in this tutorial: http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My problem is that I am not able to browse the HDFS. It only shows an entry Error:null. Upload files to DFS, and Create new directory fail. Any suggestions? I have tried to chang all the directories in the hadoop location advanced parameters to /tmp/hadoop-user, but it did not work. Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be changed, but I could not find it in the list of parameters. Thanks Iman
Re: Too many open files in 0.18.3
You are most likely hit by https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back ported. There is a 0.18 patch posted there. btw, does 16k help in your case? Ideally 1k should be enough (with small number of clients). Please try the above patch with 1k limit. Raghu. Sean Knapp wrote: Hi all, I'm continually running into the Too many open files error on 18.3: DataXceiveServer: java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) I'm writing thousands of files in the course of a few minutes, but nothing that seems too unreasonable, especially given the numbers below. I begin getting a surge of these warnings right as I hit 1024 files open by the DataNode: had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls /proc/{}/fd | wc -l 1023 This is a bit unexpected, however, since I've configured my open file limit to be 16k: had...@u10:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 268288 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 268288 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml. Thanks in advance, Sean
Running RowCounter as Standalone
I am trying to run HBase as a Data Source for my Map/Reduce. I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter output test... from the command line. But I would like to run it from my eclipse as well, so I can learn and adapt it to my own Map/Reduce needs. I opened a new project, imported hbase.0.19.0.jar, hadoop.0.19.0-core.jar (and the commons-logging) and now I am trying to run the code of RowCounter. But I get the following error message: Exception in thread main java.lang.NoClassDefFoundError: org/apache/commons/cli/ParseException at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59) at hbase.RowCounter.main(RowCounter.java:139) Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli.ParseException at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) ... 2 more Any comments on what I can try to do? Best, Philipp -- Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a
Re: Running RowCounter as Standalone
Philipp, For HBase-related questions, please post to hbase-u...@hadoop.apache.org Try importing commons-cli-2.0-SNAPSHOT.jar as well as any other jar in the lib folder just to be sure you won't get any other missing class def error. J-D On Thu, Feb 12, 2009 at 6:32 PM, Philipp Dobrigkeit pdobrigk...@gmx.dewrote: I am trying to run HBase as a Data Source for my Map/Reduce. I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter output test... from the command line. But I would like to run it from my eclipse as well, so I can learn and adapt it to my own Map/Reduce needs. I opened a new project, imported hbase.0.19.0.jar, hadoop.0.19.0-core.jar (and the commons-logging) and now I am trying to run the code of RowCounter. But I get the following error message: Exception in thread main java.lang.NoClassDefFoundError: org/apache/commons/cli/ParseException at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59) at hbase.RowCounter.main(RowCounter.java:139) Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli.ParseException at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) ... 2 more Any comments on what I can try to do? Best, Philipp -- Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a
Re: Running RowCounter as Standalone [solved]
Thank you and sorry for the mixup with the lists, I was convinced since I call ./hadoop to run it, that it might be more related to this list. But I now imported all the .jars in the lib folders of both Hadoop and Hbase and now it works. Best, Philipp Original-Nachricht Datum: Thu, 12 Feb 2009 18:54:00 -0500 Von: Jean-Daniel Cryans jdcry...@apache.org An: core-user@hadoop.apache.org, hbase-u...@hadoop.apache.org hbase-u...@hadoop.apache.org Betreff: Re: Running RowCounter as Standalone Philipp, For HBase-related questions, please post to hbase-u...@hadoop.apache.org Try importing commons-cli-2.0-SNAPSHOT.jar as well as any other jar in the lib folder just to be sure you won't get any other missing class def error. J-D On Thu, Feb 12, 2009 at 6:32 PM, Philipp Dobrigkeit pdobrigk...@gmx.dewrote: I am trying to run HBase as a Data Source for my Map/Reduce. I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter output test... from the command line. But I would like to run it from my eclipse as well, so I can learn and adapt it to my own Map/Reduce needs. I opened a new project, imported hbase.0.19.0.jar, hadoop.0.19.0-core.jar (and the commons-logging) and now I am trying to run the code of RowCounter. But I get the following error message: Exception in thread main java.lang.NoClassDefFoundError: org/apache/commons/cli/ParseException at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59) at hbase.RowCounter.main(RowCounter.java:139) Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli.ParseException at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) ... 2 more Any comments on what I can try to do? Best, Philipp -- Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a
Very large file copied to cluster, and the copy fails. All blocks bad
hello, I have a 42 GB file on the local fs(call the machine A) which i need to copy to a HDFS (replicattion 1), according the HDFS webtracker it has 208GB across 7 machines. Note, the machine A has about 80 GB total, so there is no place to store copies of the file. Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails, with all blocks being bad. This is not surprising since the file is copied entirely to the HDFS region that resides on A. Had the file been copied across all machines, this would not have failed. I have more experience with mapreduce and not much with the hdfs side of things. Is there a configuration option i'm missing that forces the file to be split across the machines(when it is being copied)? -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Too many open files in 0.18.3
Raghu, Thanks for the quick response. I've been beating up on the cluster for a while now and so far so good. I'm still at 8k... what should I expect to find with 16k versus 1k? The 8k didn't appear to be affecting things to begin with. Regards, Sean On Thu, Feb 12, 2009 at 2:07 PM, Raghu Angadi rang...@yahoo-inc.com wrote: You are most likely hit by https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back ported. There is a 0.18 patch posted there. btw, does 16k help in your case? Ideally 1k should be enough (with small number of clients). Please try the above patch with 1k limit. Raghu. Sean Knapp wrote: Hi all, I'm continually running into the Too many open files error on 18.3: DataXceiveServer: java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) I'm writing thousands of files in the course of a few minutes, but nothing that seems too unreasonable, especially given the numbers below. I begin getting a surge of these warnings right as I hit 1024 files open by the DataNode: had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls /proc/{}/fd | wc -l 1023 This is a bit unexpected, however, since I've configured my open file limit to be 16k: had...@u10:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 268288 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 268288 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml. Thanks in advance, Sean
Re: Very large file copied to cluster, and the copy fails. All blocks bad
Did you run the copy command from machine A? I believe that if you do the copy from an hdfs client that is on the same machine as a data node, then for each block the primary copy always goes to that data node, and only the replicas get distributed among other data nodes. I ran into this issue once -- I had to have the client doing the copy either on the master or on an off-cluster node. -TCK --- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote: From: Saptarshi Guha saptarshi.g...@gmail.com Subject: Very large file copied to cluster, and the copy fails. All blocks bad To: core-user@hadoop.apache.org core-user@hadoop.apache.org Date: Thursday, February 12, 2009, 9:50 PM hello, I have a 42 GB file on the local fs(call the machine A) which i need to copy to a HDFS (replicattion 1), according the HDFS webtracker it has 208GB across 7 machines. Note, the machine A has about 80 GB total, so there is no place to store copies of the file. Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails, with all blocks being bad. This is not surprising since the file is copied entirely to the HDFS region that resides on A. Had the file been copied across all machines, this would not have failed. I have more experience with mapreduce and not much with the hdfs side of things. Is there a configuration option i'm missing that forces the file to be split across the machines(when it is being copied)? -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Very large file copied to cluster, and the copy fails. All blocks bad
Did you run the copy command from machine A? Yes, exactly. I had to have the client doing the copy either on the master or on an off-cluster Thanks! I uploaded it from an off cluster (i.e not participating in the hdfs) and it worked splendidly. Regards Saptarshi On Thu, Feb 12, 2009 at 11:03 PM, TCK moonwatcher32...@yahoo.com wrote: I believe that if you do the copy from an hdfs client that is on the same machine as a data node, then for each block the primary copy always goes to that data node, and only the replicas get distributed among other data nodes. I ran into this issue once -- I had to have the client doing the copy either on the master or on an off-cluster node. -TCK --- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote: From: Saptarshi Guha saptarshi.g...@gmail.com Subject: Very large file copied to cluster, and the copy fails. All blocks bad To: core-user@hadoop.apache.org core-user@hadoop.apache.org Date: Thursday, February 12, 2009, 9:50 PM hello, I have a 42 GB file on the local fs(call the machine A) which i need to copy to a HDFS (replicattion 1), according the HDFS webtracker it has 208GB across 7 machines. Note, the machine A has about 80 GB total, so there is no place to store copies of the file. Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails, with all blocks being bad. This is not surprising since the file is copied entirely to the HDFS region that resides on A. Had the file been copied across all machines, this would not have failed. I have more experience with mapreduce and not much with the hdfs side of things. Is there a configuration option i'm missing that forces the file to be split across the machines(when it is being copied)? -- Saptarshi Guha - saptarshi.g...@gmail.com -- Saptarshi Guha - saptarshi.g...@gmail.com