namenode null pointer
All sometimes when I startup my hadoop I get the following error? 12/02/17 10:29:56 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host =iMac.local/192.168.0.191 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.203.0 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011 / 12/02/17 10:29:56 WARN impl.MetricsSystemImpl: Metrics system not started: Cannot locate configuration: tried hadoop-metrics2-namenode.properties, hadoop-metrics2.properties 2012-02-17 10:29:56.994 java[4065:1903] Unable to load realm info from SCDynamicStore 12/02/17 10:29:57 INFO util.GSet: VM type = 64-bit 12/02/17 10:29:57 INFO util.GSet: 2% max memory = 17.77875 MB 12/02/17 10:29:57 INFO util.GSet: capacity = 2^21 = 2097152 entries 12/02/17 10:29:57 INFO util.GSet: recommended=2097152, actual=2097152 12/02/17 10:29:57 INFO namenode.FSNamesystem: fsOwner=scottsue 12/02/17 10:29:57 INFO namenode.FSNamesystem: supergroup=supergroup 12/02/17 10:29:57 INFO namenode.FSNamesystem: isPermissionEnabled=true 12/02/17 10:29:57 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 12/02/17 10:29:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/02/17 10:29:57 INFO namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean 12/02/17 10:29:57 INFO namenode.NameNode: Caching file names occuring more than 10 times 12/02/17 10:29:57 INFO common.Storage: Number of files = 190 12/02/17 10:29:57 INFO common.Storage: Number of files under construction = 0 12/02/17 10:29:57 INFO common.Storage: Image file of size 26377 loaded in 0 seconds. 12/02/17 10:29:57 ERROR namenode.NameNode: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1113) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1125) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1028) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:205) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:613) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:353) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:434) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162) 12/02/17 10:29:57 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at iMac.local/192.168.0.191 /
better partitioning strategy in hive
Hello All, We have a hive table partitioned by date and hour(330 columns). We have 5 years worth of data for the table. Each hourly partition have around 800MB. So total 43,800 partitions with one file per partition. When we run select count(*) from table, hive is taking for ever to submit the job. I waited for 20 min and killed it. If i run for a month it takes little time to submit the job, but at least hive is able to get the work done?. Questions: 1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 4)Are there any special input formats or tricks to handle this? 5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? -RK
Re: better partitioning strategy in hive
Hello All, We have a hive table partitioned by date and hour(330 columns). We have 5 years worth of data for the table. Each hourly partition have around 800MB. So total 43,800 partitions with one file per partition. When we run select count(*) from table, hive is taking for ever to submit the job. I waited for 20 min and killed it. If i run for a month it takes little time to submit the job, but at least hive is able to get the work done?. Questions: 1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 4)Are there any special input formats or tricks to handle this? 5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? -RK
Where is source code for the Hadoop Eclipse plugin?
Subject says it all really :-)
Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)
Hi Edward, In the 1/2 trillion record use case that I referred to, data is streaming in from a realtime feed and needs to be online immediately, so bulk loading is not an option. In the 41 billion and 167 billion record insert tests, HBase consistently failed. We tried everything we could think of, but nothing helped. We even reduce the number of load processes on each client machine from 12 down to 1 so that the data was just trickling in and HBase still failed with *Concurrent Mode Failure*. It appeared that eventually too maintenance activity built up in the Region servers causing them to fail. - Doug On Fri, Feb 17, 2012 at 5:58 PM, Edward Capriolo edlinuxg...@gmail.comwrote: As your numbers show. Dataset SizeHypertable Queries/s HBase Queries/s Hypertable Latency (ms)HBase Latency (ms) 0.5 TB 3256.42 2969.52 157.221 172.351 5 TB2450.01 2066.52 208.972 247.680 Raw data goes up. Read performance goes down. Latency goes up. You mentioned you loaded 1/2 trillion records of historical financial data. The operative word is historical. Your not doing 1/2 trillion writes every day. Most of the system that use structured log formats can write very fast (I am guessing that is what hypertable uses btw). DD writes very fast as well, but if you want acceptable read latency you are going to need a good RAM/disk ratio. Even at 0.5 TB 157.221ms is not a great read latency, so your ability to write fast has already outstripped your ability to read at a rate that could support say web application. (I come from a world of 1-5ms latency BTW). What application can you support with numbers like that? An email compliance system where you want to store a ton of data, but only plan of doing 1 search a day to make an auditor happy? :) This is why I say your going to end up needing about the same # of nodes because when it comes time to read this data having a machine with 4Tb of data and 24 GB ram is not going to cut it. You are right on a couple of fronts 1) being able to load data fast is good (can't argue with that) 2) If hbase can't load X entries that is bad I really can't imagine that hbase blows up and just stops accepting inserts at one point. You seem to say its happening and I don't have time to verify. But if you are at the point where you are getting 175ms random and 85 zipfan latency what are you proving that is already more data then a server can handle. http://en.wikipedia.org/wiki/Network_performance Users browsing the Internet feel that responses are instant when delays are less than 100 ms from click to response[11]. Latency and throughput together affect the perceived speed of a connection. However, the perceived performance of a connection can still vary widely, depending in part on the type of information transmitted and how it is used. On Fri, Feb 17, 2012 at 7:25 PM, Doug Judd d...@hypertable.com wrote: Hi Edward, The problem is that even if the workload is 5% write and 95% read, if you can't load the data, you need more machines. In the 167 billion insert test, HBase failed with *Concurrent mode failure* after 20% of the data was loaded. One of our customers has loaded 1/2 trillion records of historical financial market data on 16 machines. If you do the back-of-the-envelope calculation, it would take about 180 machines for HBase to load 1/2 trillion cells. That makes HBase 10X more expensive in terms of hardware, power consumption, and data center real estate. - Doug On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I would almost agree with prospective. But their is a problem with 'java is slow' theory. The reason is that in a 100 percent write workload gc might be a factor. But in the real world people have to read data and read becomes disk bound as your data gets larger then memory. Unless C++ can make your disk spin faster then java It is a wash. Making a claim that your going to need more servers for java/hbase is bogus. To put it in prospective, if the workload is 5 % write and 95 % read you are probably going to need just the same amount of hardware. You might get some win on the read size because your custom caching could be more efficient in terms of object size in memory and other gc issues but it is not 2 or 3 to one. If a million writes fall into a hypertable forest but it take a billion years to read them back did the writes ever sync :) On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote: Hey Todd, Bulk loading isn't always an option when data is streaming in from a live application. Many big data use cases involve massive amounts of smaller items in the size range of 10-100 bytes, for example URLs, sensor readings, genome sequence reads, network traffic logs, etc. If HBase requires 2-3 times the
Re: Processing small xml files
On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani vas...@gmail.comwrote: Hi Mohit, You can use Pig for processing XML files. PiggyBank has build in load function to load the XML files. Also you can specify pig.maxCombinedSplitSize and pig.splitCombination for efficient processing. I can't seem to find examples of how to do xml processing in Pig. Can you please send me some pointers? Basically I need to convert my xml to more structured format using hadoop to write it to database. On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile input.xml is a plain text file. Not a sequence file. If I read it with the XMLInputFormat my mapper gets called with (key, value) pairs that look like this: (, sectionvalue1/value/section) (, sectionvalue2/value/section) Where the keys are numerical offsets into the file. I then use this information to write a sequence file with these (key, value) pairs. So my Hadoop job that uses XMLInputFormat takes a text file as input and produces a sequence file as output. I don't know a rule of thumb for how many small files is too many. Maybe someone else on the list can chime in. I just know that when your throughput gets slow that's one possible cause to investigate. I need to install hadoop. Does this xmlinput format comes as part of the install? Can you please give me some pointers that would help me install hadoop and xmlinputformat if necessary? -- -- Srinivas srini...@cloudwick.com
Re: Hadoop install
I always use the Cloudera packages, CDH3 I think it's called...but it isn't the latest by any shot. It's still .20. I think Hadoop is nearly to .23 although I'm not proficient on those kinds of details. I mentioned Cloudera's distribution because it falls into place pretty smoothly. For example, a few weeks ago and downloaded and installed it on a Mac in a few hours (and ran an example) and then installed it on a Tier 3 Linux VM and had it running examples there too. On Feb 18, 2012, at 06:24 , Mohit Anchlia wrote: What's the best way or guide to install latest hadoop. Is the latest Hadoop still .20 which comes up in google search. Could someone guide me with the latest hadoop distribution. I also need pig and mahout xmlinputformat. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
Re: Hadoop install
Thanks Do I have to do something special to get Mahout xmlinput format and Pig with the new release of hadoop? On Sat, Feb 18, 2012 at 6:42 AM, Tom Deutsch tdeut...@us.ibm.com wrote: Mohit - one place to start is here; http://hadoop.apache.org/common/releases.html#Download The release notes, as always, are well worth reading. Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Mohit Anchlia mohitanch...@gmail.com 02/18/2012 06:24 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject Hadoop install What's the best way or guide to install latest hadoop. Is the latest Hadoop still .20 which comes up in google search. Could someone guide me with the latest hadoop distribution. I also need pig and mahout xmlinputformat.
Hadoop Oppurtunity
Dear Group Members, We have openings in Hadoop/Big Data from 5 - 19 years of experience from Senior Developer to Heading the Hadoop Practice across the Globe with the TOP IT company in India. Work location may be anyware in INDIA/Global. Thanks Regards, Maheswaran A | Executive Search Recruitment cid:image002.jpg@01CB2F13.1ECF0940 Ikya Human Capital Solutions Pvt. Ltd. New No 808/22, Old No 407/23, 2nd Floor, G.R Complex Main Building, Anna Salai, Nandanam, Chennai 600035, India Direct: +91 44 6662 3008 | Mobile: +91 984 034 8330 | http://www.ikyaglobal.com/ www.ikyaglobal.com
Re: Hadoop Oppurtunity
Hi: We are looking for someone to help install and support hadoop clusters. We are in Southern California. Thanks, Larry Lesser PSSC Labs (949) 380-7288 Tel. la...@pssclabs.com 20432 North Sea Circle Lake Forest, CA 92630
Re: Hadoop Oppurtunity
Could we actually create a separate mailing list for Hadoop related jobs? On Sun, Feb 19, 2012 at 11:40 AM, larry la...@pssclabs.com wrote: Hi: We are looking for someone to help install and support hadoop clusters. We are in Southern California. Thanks, Larry Lesser PSSC Labs (949) 380-7288 Tel. la...@pssclabs.com 20432 North Sea Circle Lake Forest, CA 92630 -- Regards, R.V.