namenode null pointer

2012-02-18 Thread Ben Cuthbert
All sometimes when I startup my hadoop I get the following error?

12/02/17 10:29:56 INFO namenode.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host =iMac.local/192.168.0.191
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.20.203.0
STARTUP_MSG: build = 
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203 
-r 1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011
/
12/02/17 10:29:56 WARN impl.MetricsSystemImpl: Metrics system not started: 
Cannot locate configuration: tried hadoop-metrics2-namenode.properties, 
hadoop-metrics2.properties
2012-02-17 10:29:56.994 java[4065:1903] Unable to load realm info from 
SCDynamicStore
12/02/17 10:29:57 INFO util.GSet: VM type = 64-bit
12/02/17 10:29:57 INFO util.GSet: 2% max memory = 17.77875 MB
12/02/17 10:29:57 INFO util.GSet: capacity = 2^21 = 2097152 entries
12/02/17 10:29:57 INFO util.GSet: recommended=2097152, actual=2097152
12/02/17 10:29:57 INFO namenode.FSNamesystem: fsOwner=scottsue
12/02/17 10:29:57 INFO namenode.FSNamesystem: supergroup=supergroup
12/02/17 10:29:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/02/17 10:29:57 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/02/17 10:29:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false 
accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/02/17 10:29:57 INFO namenode.FSNamesystem: Registered FSNamesystemStateMBean 
and NameNodeMXBean
12/02/17 10:29:57 INFO namenode.NameNode: Caching file names occuring more than 
10 times 
12/02/17 10:29:57 INFO common.Storage: Number of files = 190
12/02/17 10:29:57 INFO common.Storage: Number of files under construction = 0
12/02/17 10:29:57 INFO common.Storage: Image file of size 26377 loaded in 0 
seconds.
12/02/17 10:29:57 ERROR namenode.NameNode: java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1113)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1125)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1028)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:205)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:613)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:353)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254)
at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:434)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)

12/02/17 10:29:57 INFO namenode.NameNode: SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down NameNode at iMac.local/192.168.0.191
/

better partitioning strategy in hive

2012-02-18 Thread rk vishu
Hello All,

We have a hive table partitioned by date and hour(330 columns). We have 5
years worth of data for the table. Each hourly partition have around 800MB.
So total 43,800 partitions with one file per partition.

When we run select count(*) from table, hive is taking for ever to submit
the job. I waited for 20 min and killed it. If i run for a month it takes
little time to submit the job, but at least hive is able to get the work
done?.

Questions:
1) first of all why hive is not able to even submit the job? Is it taking
for ever to query the list pf partitions from the meta store? getting 43K
recs should not be big deal at all??
2) So in order to improve my situation, what are my options? I can think of
changing the partition strategy to daily partition instead of hourly. What
should be the ideal partitioning strategy?
3) if we have one partition per day and 24 files under it (i.e less
partitions but same number of files), will it improve anything or i will
have same issue ?
4)Are there any special input formats or tricks to handle this?
5) When i tried to insert into a different table by selecting from whole
days data, hive generate 164mappers with map-only jobs, hence creating many
output files. How can force hive to create one output file instead of many.
Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
can do to achieve this?


-RK


Re: better partitioning strategy in hive

2012-02-18 Thread rk vishu
 Hello All,

 We have a hive table partitioned by date and hour(330 columns). We have 5
 years worth of data for the table. Each hourly partition have around 800MB.
 So total 43,800 partitions with one file per partition.

 When we run select count(*) from table, hive is taking for ever to submit
 the job. I waited for 20 min and killed it. If i run for a month it takes
 little time to submit the job, but at least hive is able to get the work
 done?.

 Questions:
 1) first of all why hive is not able to even submit the job? Is it taking
 for ever to query the list pf partitions from the meta store? getting 43K
 recs should not be big deal at all??
 2) So in order to improve my situation, what are my options? I can think
 of changing the partition strategy to daily partition instead of hourly.
 What should be the ideal partitioning strategy?
 3) if we have one partition per day and 24 files under it (i.e less
 partitions but same number of files), will it improve anything or i will
 have same issue ?
 4)Are there any special input formats or tricks to handle this?
 5) When i tried to insert into a different table by selecting from whole
 days data, hive generate 164mappers with map-only jobs, hence creating many
 output files. How can force hive to create one output file instead of many.
 Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
 can do to achieve this?


 -RK








Where is source code for the Hadoop Eclipse plugin?

2012-02-18 Thread Andy Doddington
Subject says it all really :-)


Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)

2012-02-18 Thread Doug Judd
Hi Edward,

In the 1/2 trillion record use case that I referred to, data is streaming
in from a realtime feed and needs to be online immediately, so bulk loading
is not an option.  In the 41 billion and 167 billion record insert tests,
HBase consistently failed.  We tried everything we could think of, but
nothing helped.  We even reduce the number of load processes on each client
machine from 12 down to 1 so that the data was just trickling in and HBase
still failed with *Concurrent Mode Failure*.  It appeared that eventually
too maintenance activity built up in the Region servers causing them to
fail.

- Doug

On Fri, Feb 17, 2012 at 5:58 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 As your numbers show.

 Dataset
 SizeHypertable
 Queries/s   HBase
 Queries/s   Hypertable
 Latency (ms)HBase
 Latency (ms)
 0.5 TB  3256.42 2969.52 157.221 172.351
 5 TB2450.01 2066.52 208.972 247.680

 Raw data goes up. Read performance goes down. Latency goes up.

 You mentioned you loaded 1/2 trillion records of historical financial
 data. The operative word is historical. Your not doing 1/2 trillion
 writes every day.

 Most of the system that use structured log formats can write very fast
 (I am guessing that is what hypertable uses btw). DD writes very fast
 as well, but if you want acceptable read latency you are going to need
 a good RAM/disk ratio.

 Even at 0.5 TB 157.221ms is not a great read latency, so your ability
 to write fast has already outstripped your ability to read at a rate
 that could support say web application. (I come from a world of 1-5ms
 latency BTW).

 What application can you support with numbers like that? An email
 compliance system where you want to store a ton of data, but only plan
 of doing 1 search a day to make an auditor happy? :) This is why I say
 your going to end up needing about the same # of nodes because when it
 comes time to read this data having a machine with 4Tb of data and 24
 GB ram is not going to cut it.

 You are right on a couple of fronts
 1) being able to load data fast is good (can't argue with that)
 2) If hbase can't load X entries that is bad

 I really can't imagine that hbase blows up and just stops accepting
 inserts at one point. You seem to say its happening and I don't have
 time to verify. But if you are at the point where you are getting
 175ms random and 85 zipfan latency what are you proving that is
 already more data then a server can handle.

 http://en.wikipedia.org/wiki/Network_performance

 Users browsing the Internet feel that responses are instant when
 delays are less than 100 ms from click to response[11]. Latency and
 throughput together affect the perceived speed of a connection.
 However, the perceived performance of a connection can still vary
 widely, depending in part on the type of information transmitted and
 how it is used.



 On Fri, Feb 17, 2012 at 7:25 PM, Doug Judd d...@hypertable.com wrote:
  Hi Edward,
 
  The problem is that even if the workload is 5% write and 95% read, if you
  can't load the data, you need more machines.  In the 167 billion insert
  test, HBase failed with *Concurrent mode failure* after 20% of the data
 was
  loaded.  One of our customers has loaded 1/2 trillion records of
 historical
  financial market data on 16 machines.  If you do the back-of-the-envelope
  calculation, it would take about 180 machines for HBase to load 1/2
  trillion cells.  That makes HBase 10X more expensive in terms of
 hardware,
  power consumption, and data center real estate.
 
  - Doug
 
  On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
  I would almost agree with prospective. But their is a problem with
 'java is
  slow' theory. The reason is that in a 100 percent write workload gc
 might
  be a factor.
 
  But in the real world people have to read data and read becomes disk
 bound
  as your data gets larger then memory.
 
  Unless C++ can make your disk spin faster then java It is a wash.
 Making a
  claim that your going to need more servers for java/hbase is bogus. To
 put
  it in prospective, if the workload is 5 % write and 95 % read you are
  probably going to need just the same amount of hardware.
 
  You might get some win on the read size because your custom caching
 could
  be more efficient in terms of object size in memory and other gc issues
 but
  it is not 2 or 3 to one.
 
  If a million writes fall into a hypertable forest but it take a billion
  years to read them back did the writes ever sync :)
 
 
  On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote:
   Hey Todd,
  
   Bulk loading isn't always an option when data is streaming in from a
 live
   application.  Many big data use cases involve massive amounts of
 smaller
   items in the size range of 10-100 bytes, for example URLs, sensor
  readings,
   genome sequence reads, network traffic logs, etc.  If HBase requires
 2-3
   times the 

Re: Processing small xml files

2012-02-18 Thread Mohit Anchlia
On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani vas...@gmail.comwrote:

 Hi Mohit,

 You can use Pig for processing XML files. PiggyBank has build in load
 function to load the XML files.
  Also you can specify pig.maxCombinedSplitSize  and
 pig.splitCombination for efficient processing.


I can't seem to find examples of how to do xml processing in Pig. Can you
please send me some pointers? Basically I need to convert my xml to more
structured format using hadoop to write it to database.


 On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com
 wrote:
 
  I'm not sure what you mean by flat format here.
 
  In my scenario, I have an file input.xml that looks like this.
 
  myfile
section
   value1/value
/section
section
   value2/value
/section
  /myfile
 
  input.xml is a plain text file. Not a sequence file. If I read it with
 the
  XMLInputFormat my mapper gets called with (key, value) pairs that look
 like
  this:
 
  (, sectionvalue1/value/section)
  (, sectionvalue2/value/section)
 
  Where the keys are numerical offsets into the file. I then use this
  information to write a sequence file with these (key, value) pairs. So
 my
  Hadoop job that uses XMLInputFormat takes a text file as input and
 produces
  a sequence file as output.
 
  I don't know a rule of thumb for how many small files is too many. Maybe
  someone else on the list can chime in. I just know that when your
  throughput gets slow that's one possible cause to investigate.
 
 
  I need to install hadoop. Does this xmlinput format comes as part of the
  install? Can you please give me some pointers that would help me install
  hadoop and xmlinputformat if necessary?



 --
 -- Srinivas
 srini...@cloudwick.com



Re: Hadoop install

2012-02-18 Thread Keith Wiley
I always use the Cloudera packages, CDH3 I think it's called...but it isn't the 
latest by any shot.  It's still .20.  I think Hadoop is nearly to .23 although 
I'm not proficient on those kinds of details.  I mentioned Cloudera's 
distribution because it falls into place pretty smoothly.  For example, a few 
weeks ago and downloaded and installed it on a Mac in a few hours (and ran an 
example) and then installed it on a Tier 3 Linux VM and had it running examples 
there too.

On Feb 18, 2012, at 06:24 , Mohit Anchlia wrote:

 What's the best way or guide to install latest hadoop. Is the latest Hadoop
 still .20 which comes up in google search. Could someone guide me with the
 latest hadoop distribution. I also need pig and mahout xmlinputformat.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow.
   --  Keith Wiley




Re: Hadoop install

2012-02-18 Thread Mohit Anchlia
Thanks Do I have to do something special to get Mahout xmlinput format and
Pig with the new release of hadoop?

On Sat, Feb 18, 2012 at 6:42 AM, Tom Deutsch tdeut...@us.ibm.com wrote:

 Mohit - one place to start is here;

 http://hadoop.apache.org/common/releases.html#Download

 The release notes, as always, are well worth reading.

 
 Tom Deutsch
 Program Director
 Information Management
 Big Data Technologies
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com




 Mohit Anchlia mohitanch...@gmail.com
 02/18/2012 06:24 AM
 Please respond to
 common-user@hadoop.apache.org


 To
 common-user@hadoop.apache.org
 cc

 Subject
 Hadoop install






 What's the best way or guide to install latest hadoop. Is the latest
 Hadoop
 still .20 which comes up in google search. Could someone guide me with the
 latest hadoop distribution. I also need pig and mahout xmlinputformat.




Hadoop Oppurtunity

2012-02-18 Thread maheswaran
Dear Group Members,

 

We have openings in Hadoop/Big Data from 5 - 19 years of experience from
Senior Developer to Heading the Hadoop Practice across the Globe with the
TOP IT company in India.

 

Work location may be anyware in INDIA/Global. 

 

Thanks  Regards,

Maheswaran A | Executive Search  Recruitment 

cid:image002.jpg@01CB2F13.1ECF0940

Ikya Human Capital Solutions Pvt. Ltd.

New No 808/22, Old No 407/23, 2nd Floor, 

G.R Complex Main Building, 

Anna Salai, Nandanam,

Chennai 600035, India

Direct: +91 44 6662 3008 | Mobile: +91 984 034 8330 |
http://www.ikyaglobal.com/ www.ikyaglobal.com

 



Re: Hadoop Oppurtunity

2012-02-18 Thread larry

Hi:

We are looking for someone to help install and support hadoop 
clusters.  We are in Southern California.


Thanks,

Larry Lesser
PSSC Labs
(949) 380-7288 Tel.
la...@pssclabs.com
20432 North Sea Circle
Lake Forest, CA 92630



Re: Hadoop Oppurtunity

2012-02-18 Thread real great..
Could we actually create a separate mailing list for Hadoop related jobs?

On Sun, Feb 19, 2012 at 11:40 AM, larry la...@pssclabs.com wrote:

 Hi:

 We are looking for someone to help install and support hadoop clusters.
  We are in Southern California.

 Thanks,

 Larry Lesser
 PSSC Labs
 (949) 380-7288 Tel.
 la...@pssclabs.com
 20432 North Sea Circle
 Lake Forest, CA 92630




-- 
Regards,
R.V.