Re: Processing small xml files
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill wrote: > I've used the Mahout XMLInputFormat. It is the right tool if you have an > XML file with one type of section repeated over and over again and want to > turn that into Sequence file where each repeated section is a value. I've > found it helpful as a preprocessing step for converting raw XML input into > something that can be handled by Hadoop jobs. Thanks for the input. Do you first convert it into flat format and then run another hadoop job or do you just read xml sequence file and then perform reduce on that. Is there an advantage of first converting it into a flat file format? > > If you're worried about having lots of small files--specifically, about > overwhelming your namenode because you have too many small > files--the XMLInputFormat won't help with that. However, it may be possible > to concatenate the small files into larger files, then have a Hadoop job > that uses XMLInputFormat transform the large files into sequence files. How many are too many for namenode? We have around 100M files and 100M files every year
Re: Processing small xml files
I've used the Mahout XMLInputFormat. It is the right tool if you have an XML file with one type of section repeated over and over again and want to turn that into Sequence file where each repeated section is a value. I've found it helpful as a preprocessing step for converting raw XML input into something that can be handled by Hadoop jobs. If you're worried about having lots of small files--specifically, about overwhelming your namenode because you have too many small files--the XMLInputFormat won't help with that. However, it may be possible to concatenate the small files into larger files, then have a Hadoop job that uses XMLInputFormat transform the large files into sequence files.
Re: Error in Formatting NameNode
Manish If you read the error message, it says "connection refused". Big clue :-) You probably have firewall configured. Raj Sent from my iPad Please excuse the typos. On Feb 12, 2012, at 1:41 AM, Manish Maheshwari wrote: > Thanks, > > I tried with hadoop-1.0.0 and JRE6 and things are looking good. I was able > to format the namenode and bring up the NameNode 'calvin-PC:47110' and > Hadoop Map/Reduce Administration webpages. > > Further i tried the example of TestDFSIO but get the below error of > connection refused. > > -bash-4.1$ cd share/hadoop > -bash-4.1$ ../../bin/hadoop jar hadoop-test-1.0.0.jar TestDFSIO –write > –nrFiles 1 –filesize 10 > Warning: $HADOOP_HOME is deprecated. > > TestDFSIO.0.0.4 > 12/02/12 15:05:08 INFO fs.TestDFSIO: nrFiles = 1 > 12/02/12 15:05:08 INFO fs.TestDFSIO: fileSize (MB) = 1 > 12/02/12 15:05:08 INFO fs.TestDFSIO: bufferSize = 100 > 12/02/12 15:05:08 INFO fs.TestDFSIO: creating control file: 1 mega bytes, 1 > files > 12/02/12 15:05:08 INFO fs.TestDFSIO: created control files for: 1 files > 12/02/12 15:05:11 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 0 time(s). > 12/02/12 15:05:13 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 1 time(s). > 12/02/12 15:05:15 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 2 time(s). > 12/02/12 15:05:17 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 3 time(s). > 12/02/12 15:05:19 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 4 time(s). > 12/02/12 15:05:21 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 5 time(s). > 12/02/12 15:05:23 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 6 time(s). > 12/02/12 15:05:25 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 7 time(s). > 12/02/12 15:05:27 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 8 time(s). > 12/02/12 15:05:29 INFO ipc.Client: Retrying connect to server: calvin-PC/ > 127.0.0.1:8021. Already tried 9 time(s). > java.net.ConnectException: Call to calvin-PC/127.0.0.1:8021 failed on > connection exception: java.net.ConnectException: Connection refused: no > further information >at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) >at org.apache.hadoop.ipc.Client.call(Client.java:1071) >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) >at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown > Source) >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) >at > org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) >at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) >at org.apache.hadoop.mapred.JobClient.(JobClient.java:457) >at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260) >at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:257) >at org.apache.hadoop.fs.TestDFSIO.readTest(TestDFSIO.java:295) >at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:459) >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:317) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >at java.lang.reflect.Method.invoke(Method.java:597) >at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >at java.lang.reflect.Method.invoke(Method.java:597) >at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > Caused by: java.net.ConnectException: Connection refused: no further > information >at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) >at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) >at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434
Re: Error in Formatting NameNode
Thanks, I tried with hadoop-1.0.0 and JRE6 and things are looking good. I was able to format the namenode and bring up the NameNode 'calvin-PC:47110' and Hadoop Map/Reduce Administration webpages. Further i tried the example of TestDFSIO but get the below error of connection refused. -bash-4.1$ cd share/hadoop -bash-4.1$ ../../bin/hadoop jar hadoop-test-1.0.0.jar TestDFSIO –write –nrFiles 1 –filesize 10 Warning: $HADOOP_HOME is deprecated. TestDFSIO.0.0.4 12/02/12 15:05:08 INFO fs.TestDFSIO: nrFiles = 1 12/02/12 15:05:08 INFO fs.TestDFSIO: fileSize (MB) = 1 12/02/12 15:05:08 INFO fs.TestDFSIO: bufferSize = 100 12/02/12 15:05:08 INFO fs.TestDFSIO: creating control file: 1 mega bytes, 1 files 12/02/12 15:05:08 INFO fs.TestDFSIO: created control files for: 1 files 12/02/12 15:05:11 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 0 time(s). 12/02/12 15:05:13 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 1 time(s). 12/02/12 15:05:15 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 2 time(s). 12/02/12 15:05:17 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 3 time(s). 12/02/12 15:05:19 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 4 time(s). 12/02/12 15:05:21 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 5 time(s). 12/02/12 15:05:23 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 6 time(s). 12/02/12 15:05:25 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 7 time(s). 12/02/12 15:05:27 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 8 time(s). 12/02/12 15:05:29 INFO ipc.Client: Retrying connect to server: calvin-PC/ 127.0.0.1:8021. Already tried 9 time(s). java.net.ConnectException: Call to calvin-PC/127.0.0.1:8021 failed on connection exception: java.net.ConnectException: Connection refused: no further information at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) at org.apache.hadoop.ipc.Client.call(Client.java:1071) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) at org.apache.hadoop.mapred.JobClient.(JobClient.java:457) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260) at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:257) at org.apache.hadoop.fs.TestDFSIO.readTest(TestDFSIO.java:295) at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:459) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:317) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) at org.apache.hadoop.ipc.Client.call(Client.java:1046) ... 26 more -bash-4.1$ Is this a problem with ssh. ssh daemon is stil
Re: The NCDC Weather Data for Hadoop the Definitive Guide
Andy, Since there is a lot of data on the free data of the site, I cannot figure out which one is the one talked in the book. Any format differences might cause the source code to get exceptions. Some data is even in PDF format! Thanks so much! Bing On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington wrote: > According to Page 15 of the book, this data is available from the US > National Climatic Data Center, at > http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of > links on the left-hand side of the > page, listed under the heading ‘Data & Products’. I suspect that the entry > labelled ‘Free Data’ is the most > likely area you need to investigate :-) > > Good Luck > > Andy D > > > > On 12 Feb 2012, at 07:14, Bing Li wrote: > > > Dear all, > > > > I am following the book, Hadoop: the Definitive Guide. However, I got > stuck > > because I could not get the NCDC Weather data that is used by the source > > code in the book. The Appendix C told me I could follow some instructions > > in www.hadoopbook.com. But I didn't get the instructions there. Could > you > > give me a hand? > > > > Thanks so much! > > > > Best regards, > > Bing > >
Re: The NCDC Weather Data for Hadoop the Definitive Guide
According to Page 15 of the book, this data is available from the US National Climatic Data Center, at http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of links on the left-hand side of the page, listed under the heading ‘Data & Products’. I suspect that the entry labelled ‘Free Data’ is the most likely area you need to investigate :-) Good Luck Andy D On 12 Feb 2012, at 07:14, Bing Li wrote: > Dear all, > > I am following the book, Hadoop: the Definitive Guide. However, I got stuck > because I could not get the NCDC Weather data that is used by the source > code in the book. The Appendix C told me I could follow some instructions > in www.hadoopbook.com. But I didn't get the instructions there. Could you > give me a hand? > > Thanks so much! > > Best regards, > Bing