Re: Processing small xml files

2012-02-12 Thread Mohit Anchlia
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill  wrote:
> I've used the Mahout XMLInputFormat. It is the right tool if you have an
> XML file with one type of section repeated over and over again and want to
> turn that into Sequence file where each repeated section is a value. I've
> found it helpful as a preprocessing step for converting raw XML input into
> something that can be handled by Hadoop jobs.

Thanks for the input.

Do you first convert it into flat format and then run another hadoop
job or do you just read xml sequence file and then perform reduce on
that. Is there an advantage of first converting it into a flat file
format?
>
> If you're worried about having lots of small files--specifically, about
> overwhelming your namenode because you have too many small
> files--the XMLInputFormat won't help with that. However, it may be possible
> to concatenate the small files into larger files, then have a Hadoop job
> that uses XMLInputFormat transform the large files into sequence files.

How many are too many for namenode? We have around 100M files and 100M
files every year


Re: Processing small xml files

2012-02-12 Thread W.P. McNeill
I've used the Mahout XMLInputFormat. It is the right tool if you have an
XML file with one type of section repeated over and over again and want to
turn that into Sequence file where each repeated section is a value. I've
found it helpful as a preprocessing step for converting raw XML input into
something that can be handled by Hadoop jobs.

If you're worried about having lots of small files--specifically, about
overwhelming your namenode because you have too many small
files--the XMLInputFormat won't help with that. However, it may be possible
to concatenate the small files into larger files, then have a Hadoop job
that uses XMLInputFormat transform the large files into sequence files.


Re: Error in Formatting NameNode

2012-02-12 Thread Raj Vishwanathan
Manish

If you read the error message, it says "connection refused". Big clue :-) 

You probably have firewall configured. 

Raj

Sent from my iPad
Please excuse the typos. 

On Feb 12, 2012, at 1:41 AM, Manish Maheshwari  wrote:

> Thanks,
> 
> I tried with hadoop-1.0.0 and JRE6 and things are looking good. I was able
> to format the namenode and bring up the NameNode 'calvin-PC:47110' and
> Hadoop Map/Reduce Administration webpages.
> 
> Further i tried the example of TestDFSIO but get the below error of
> connection refused.
> 
> -bash-4.1$ cd share/hadoop
> -bash-4.1$ ../../bin/hadoop jar hadoop-test-1.0.0.jar TestDFSIO –write
> –nrFiles 1 –filesize 10
> Warning: $HADOOP_HOME is deprecated.
> 
> TestDFSIO.0.0.4
> 12/02/12 15:05:08 INFO fs.TestDFSIO: nrFiles = 1
> 12/02/12 15:05:08 INFO fs.TestDFSIO: fileSize (MB) = 1
> 12/02/12 15:05:08 INFO fs.TestDFSIO: bufferSize = 100
> 12/02/12 15:05:08 INFO fs.TestDFSIO: creating control file: 1 mega bytes, 1
> files
> 12/02/12 15:05:08 INFO fs.TestDFSIO: created control files for: 1 files
> 12/02/12 15:05:11 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 0 time(s).
> 12/02/12 15:05:13 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 1 time(s).
> 12/02/12 15:05:15 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 2 time(s).
> 12/02/12 15:05:17 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 3 time(s).
> 12/02/12 15:05:19 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 4 time(s).
> 12/02/12 15:05:21 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 5 time(s).
> 12/02/12 15:05:23 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 6 time(s).
> 12/02/12 15:05:25 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 7 time(s).
> 12/02/12 15:05:27 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 8 time(s).
> 12/02/12 15:05:29 INFO ipc.Client: Retrying connect to server: calvin-PC/
> 127.0.0.1:8021. Already tried 9 time(s).
> java.net.ConnectException: Call to calvin-PC/127.0.0.1:8021 failed on
> connection exception: java.net.ConnectException: Connection refused: no
> further information
>at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
>at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown
> Source)
>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>at
> org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480)
>at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474)
>at org.apache.hadoop.mapred.JobClient.(JobClient.java:457)
>at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260)
>at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:257)
>at org.apache.hadoop.fs.TestDFSIO.readTest(TestDFSIO.java:295)
>at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:459)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:317)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.net.ConnectException: Connection refused: no further
> information
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
>at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434

Re: Error in Formatting NameNode

2012-02-12 Thread Manish Maheshwari
Thanks,

I tried with hadoop-1.0.0 and JRE6 and things are looking good. I was able
to format the namenode and bring up the NameNode 'calvin-PC:47110' and
Hadoop Map/Reduce Administration webpages.

Further i tried the example of TestDFSIO but get the below error of
connection refused.

-bash-4.1$ cd share/hadoop
-bash-4.1$ ../../bin/hadoop jar hadoop-test-1.0.0.jar TestDFSIO –write
–nrFiles 1 –filesize 10
Warning: $HADOOP_HOME is deprecated.

TestDFSIO.0.0.4
12/02/12 15:05:08 INFO fs.TestDFSIO: nrFiles = 1
12/02/12 15:05:08 INFO fs.TestDFSIO: fileSize (MB) = 1
12/02/12 15:05:08 INFO fs.TestDFSIO: bufferSize = 100
12/02/12 15:05:08 INFO fs.TestDFSIO: creating control file: 1 mega bytes, 1
files
12/02/12 15:05:08 INFO fs.TestDFSIO: created control files for: 1 files
12/02/12 15:05:11 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 0 time(s).
12/02/12 15:05:13 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 1 time(s).
12/02/12 15:05:15 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 2 time(s).
12/02/12 15:05:17 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 3 time(s).
12/02/12 15:05:19 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 4 time(s).
12/02/12 15:05:21 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 5 time(s).
12/02/12 15:05:23 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 6 time(s).
12/02/12 15:05:25 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 7 time(s).
12/02/12 15:05:27 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 8 time(s).
12/02/12 15:05:29 INFO ipc.Client: Retrying connect to server: calvin-PC/
127.0.0.1:8021. Already tried 9 time(s).
java.net.ConnectException: Call to calvin-PC/127.0.0.1:8021 failed on
connection exception: java.net.ConnectException: Connection refused: no
further information
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
at org.apache.hadoop.ipc.Client.call(Client.java:1071)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown
Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474)
at org.apache.hadoop.mapred.JobClient.(JobClient.java:457)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1260)
at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:257)
at org.apache.hadoop.fs.TestDFSIO.readTest(TestDFSIO.java:295)
at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:459)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:317)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.net.ConnectException: Connection refused: no further
information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
at
org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
at org.apache.hadoop.ipc.Client.call(Client.java:1046)
... 26 more
-bash-4.1$

Is this a problem with ssh. ssh daemon is stil

Re: The NCDC Weather Data for Hadoop the Definitive Guide

2012-02-12 Thread Bing Li
Andy,

Since there is a lot of data on the free data of the site, I cannot figure
out which one is the one talked in the book. Any format differences might
cause the source code to get exceptions. Some data is even in PDF format!

Thanks so much!
Bing

On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington wrote:

> According to Page 15 of the book, this data is available from the US
> National Climatic Data Center, at
> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
> links on the left-hand side of the
> page, listed under the heading ‘Data & Products’. I suspect that the entry
> labelled ‘Free Data’ is the most
> likely area you need to investigate :-)
>
> Good Luck
>
> Andy D
>
> 
>
> On 12 Feb 2012, at 07:14, Bing Li wrote:
>
> > Dear all,
> >
> > I am following the book, Hadoop: the Definitive Guide. However, I got
> stuck
> > because I could not get the NCDC Weather data that is used by the source
> > code in the book. The Appendix C told me I could follow some instructions
> > in www.hadoopbook.com. But I didn't get the instructions there. Could
> you
> > give me a hand?
> >
> > Thanks so much!
> >
> > Best regards,
> > Bing
>
>


Re: The NCDC Weather Data for Hadoop the Definitive Guide

2012-02-12 Thread Andy Doddington
According to Page 15 of the book, this data is available from the US National 
Climatic Data Center, at
http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of links 
on the left-hand side of the
page, listed under the heading ‘Data & Products’. I suspect that the entry 
labelled ‘Free Data’ is the most
likely area you need to investigate :-)

Good Luck

Andy D



On 12 Feb 2012, at 07:14, Bing Li wrote:

> Dear all,
> 
> I am following the book, Hadoop: the Definitive Guide. However, I got stuck
> because I could not get the NCDC Weather data that is used by the source
> code in the book. The Appendix C told me I could follow some instructions
> in www.hadoopbook.com. But I didn't get the instructions there. Could you
> give me a hand?
> 
> Thanks so much!
> 
> Best regards,
> Bing