when run mrbench on pseudo distributed failed.....

2012-02-17 Thread Rock Ice
Hi guys,

execute command : bin/hadoop jar build/hadoop-0.20.3-dev-test.jar mrbench
-numRuns 5

error message:

[hadoop@localhost hadoop]$ bin/hadoop jar build/hadoop-0.20.3-dev-test.jar
mrbench -numRuns 5
MRBenchmark.0.0.2
12/02/17 16:45:11 INFO mapred.MRBench: creating control file: 1 numLines,
ASCENDING sortOrder
12/02/17 16:45:11 INFO mapred.MRBench: created control file:
/benchmarks/MRBench/mr_input/input_-650014558.txt
java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:394)
at java.util.Properties.setProperty(Properties.java:143)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:404)
at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:243)
at
org.apache.hadoop.mapred.MRBench.runJobInSequence(MRBench.java:177)
at org.apache.hadoop.mapred.MRBench.main(MRBench.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Fwd: when run mrbench on pseudo distributed failed.....

2012-02-17 Thread Rock Ice
Hi guys,

execute command : bin/hadoop jar build/hadoop-0.20.3-dev-test.jar mrbench
-numRuns 5

error message:

[hadoop@localhost hadoop]$ bin/hadoop jar build/hadoop-0.20.3-dev-test.jar
mrbench -numRuns 5
MRBenchmark.0.0.2
12/02/17 16:45:11 INFO mapred.MRBench: creating control file: 1 numLines,
ASCENDING sortOrder
12/02/17 16:45:11 INFO mapred.MRBench: created control file:
/benchmarks/MRBench/mr_input/input_-650014558.txt
java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:394)
at java.util.Properties.setProperty(Properties.java:143)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:404)
at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:243)
at
org.apache.hadoop.mapred.MRBench.runJobInSequence(MRBench.java:177)
at org.apache.hadoop.mapred.MRBench.main(MRBench.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Hadoop Example in java

2012-02-17 Thread vikas jain

Hi All,

I am looking for example in java for hadoop. I have done lots of search but
I have only found word count. Are there any other exapmple for the same. 
-- 
View this message in context: 
http://old.nabble.com/Hadoop-Example-in-java-tp33341353p33341353.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Hadoop Example in java

2012-02-17 Thread Harsh J
For more framework-provided examples, also take a look at your
downloaded distributions' src/examples directory.

I also suggest getting 'Hadoop: The Definitive Guide by Tom White
(O'Reilly) to get started with, it carries examples and all other
information useful for using/deploying/developing with Apache Hadoop.

On Fri, Feb 17, 2012 at 2:30 PM, vikas jain mailvkj...@gmail.com wrote:

 Hi All,

 I am looking for example in java for hadoop. I have done lots of search but
 I have only found word count. Are there any other exapmple for the same.
 --
 View this message in context: 
 http://old.nabble.com/Hadoop-Example-in-java-tp33341353p33341353.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about


Building Hadoop UI

2012-02-17 Thread fabio . pitzolu
Hello everyone,

in order to provide our clients a custom UI for their MapReduce jobs and HDFS 
files, what is the best solution to create a web-based UI for Hadoop?
We are not going to use Cloudera HUE, we need something more user-friendly and 
shaped for our clients needs.

Thanks,

Fabio Pitzolu

Re: Building Hadoop UI

2012-02-17 Thread Jamack, Peter
You could use something like Spring, but you'll need to figure ways to
connect and integrate and it'll be a homegrown solution.

Peter Jamack

On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com
wrote:

Hello everyone,

in order to provide our clients a custom UI for their MapReduce jobs and
HDFS files, what is the best solution to create a web-based UI for Hadoop?
We are not going to use Cloudera HUE, we need something more
user-friendly and shaped for our clients needs.

Thanks,

Fabio Pitzolu



Re: Building Hadoop UI

2012-02-17 Thread fabio . pitzolu
Thanks Peter,

I'll give Spring a try, also because there is also a .NET version of this 
framework, which is basically what my company is looking for.


Fabio Pitzolu
fabio.pitz...@gmail.com

Il giorno 17/feb/2012, alle ore 15:31, Jamack, Peter ha scritto:

 You could use something like Spring, but you'll need to figure ways to
 connect and integrate and it'll be a homegrown solution.
 
 Peter Jamack
 
 On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com
 wrote:
 
 Hello everyone,
 
 in order to provide our clients a custom UI for their MapReduce jobs and
 HDFS files, what is the best solution to create a web-based UI for Hadoop?
 We are not going to use Cloudera HUE, we need something more
 user-friendly and shaped for our clients needs.
 
 Thanks,
 
 Fabio Pitzolu
 



Re: Building Hadoop UI

2012-02-17 Thread neil . harwani
We in our company have integrated Liferay which is an open source java portal 
with hbase using its java api by creating an extension to its document library 
portlet to show content. We are indexing content separately on a solr index 
server where you can search and that content (image, docs, etc.) Is pulled from 
hbase hadoop.

Regards,
Neil


Sent on my BlackBerry® from Vodafone

-Original Message-
From: fabio.pitz...@gmail.com
Date: Fri, 17 Feb 2012 17:02:57 
To: common-user@hadoop.apache.org
Reply-To: common-user@hadoop.apache.org
Subject: Re: Building Hadoop UI

Thanks Peter,

I'll give Spring a try, also because there is also a .NET version of this 
framework, which is basically what my company is looking for.


Fabio Pitzolu
fabio.pitz...@gmail.com

Il giorno 17/feb/2012, alle ore 15:31, Jamack, Peter ha scritto:

 You could use something like Spring, but you'll need to figure ways to
 connect and integrate and it'll be a homegrown solution.
 
 Peter Jamack
 
 On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com
 wrote:
 
 Hello everyone,
 
 in order to provide our clients a custom UI for their MapReduce jobs and
 HDFS files, what is the best solution to create a web-based UI for Hadoop?
 We are not going to use Cloudera HUE, we need something more
 user-friendly and shaped for our clients needs.
 
 Thanks,
 
 Fabio Pitzolu
 




Re: Building Hadoop UI

2012-02-17 Thread Fabio Pitzolu
Thanks Neil, I'll get some documentation for that.
Can we maybe get in touch so you can explain me a little more of that
integration?

Fabio

2012/2/17 neil.harw...@gmail.com

 We in our company have integrated Liferay which is an open source java
 portal with hbase using its java api by creating an extension to its
 document library portlet to show content. We are indexing content
 separately on a solr index server where you can search and that content
 (image, docs, etc.) Is pulled from hbase hadoop.

 Regards,
 Neil


 Sent on my BlackBerry® from Vodafone

 -Original Message-
 From: fabio.pitz...@gmail.com
 Date: Fri, 17 Feb 2012 17:02:57
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: Building Hadoop UI

 Thanks Peter,

 I'll give Spring a try, also because there is also a .NET version of this
 framework, which is basically what my company is looking for.


 Fabio Pitzolu
 fabio.pitz...@gmail.com

 Il giorno 17/feb/2012, alle ore 15:31, Jamack, Peter ha scritto:

  You could use something like Spring, but you'll need to figure ways to
  connect and integrate and it'll be a homegrown solution.
 
  Peter Jamack
 
  On 2/17/12 5:52 AM, fabio.pitz...@gmail.com fabio.pitz...@gmail.com
  wrote:
 
  Hello everyone,
 
  in order to provide our clients a custom UI for their MapReduce jobs and
  HDFS files, what is the best solution to create a web-based UI for
 Hadoop?
  We are not going to use Cloudera HUE, we need something more
  user-friendly and shaped for our clients needs.
 
  Thanks,
 
  Fabio Pitzolu
 





Re: Hadoop Example in java

2012-02-17 Thread Owen O'Malley
On Fri, Feb 17, 2012 at 1:00 AM, vikas jain mailvkj...@gmail.com wrote:

 Hi All,

 I am looking for example in java for hadoop. I have done lots of search but
 I have only found word count. Are there any other exapmple for the same.

If you want to find them on the web, you can look in subversion:

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/

-- Owen


Re: location of HDFS directory on my Local system

2012-02-17 Thread Rohit
Hi Sujit, 

I'd recommend you look at this tutorial on interacting with HDFS:
http://developer.yahoo.com/hadoop/tutorial/module2.html#interacting

When you create a folder '/foodir' in HDFS, that folder will be in the top 
level of HDFS. If you were to create a directory 'foodir' (without the '/'), 
that directory would go into your user's folder on HDFS - 
'/user/username/foodir.

You can view HDFS directory listings via:
~bin/hadoop dfs -lsr /


Rohit Bakhshi



www.hortonworks.com (http://www.hortonworks.com/)





On Friday, February 17, 2012 at 12:18 PM, Sujit Dhamale wrote:

 Hi,i am new to Hadoop
 
 i install Hadoop cluster on my Ubuntu
 
 using Below command i created one folder *foodir*
 *~bin/hadoop dfs -mkdir /foodir*
 
 in my file system , what will be location of folder *foodir* ?? 



Re: location of HDFS directory on my Local system

2012-02-17 Thread Harsh J
To add onto Rohit's great response on HDFS interaction, and to clear
your confusion here, HDFS does not exactly make itself available as a
physical filesystem like say ext3/4 mount points. The usual way of
interacting with them is by communicating with the services you run,
via RPC calls (over a network).

That said, you can still use the FUSE-DFS connector from HDFS, to
mount HDFS as a local file system, and run most of your regular *nix
programs (ls, cat, etc.) on its files after it has been mounted. To
mount HDFS via fuse-dfs drivers, follow
http://wiki.apache.org/hadoop/MountableHDFS

On Sat, Feb 18, 2012 at 2:06 AM, Rohit ro...@hortonworks.com wrote:
 Hi Sujit,

 I'd recommend you look at this tutorial on interacting with HDFS:
 http://developer.yahoo.com/hadoop/tutorial/module2.html#interacting

 When you create a folder '/foodir' in HDFS, that folder will be in the top 
 level of HDFS. If you were to create a directory 'foodir' (without the '/'), 
 that directory would go into your user's folder on HDFS - 
 '/user/username/foodir.

 You can view HDFS directory listings via:
 ~bin/hadoop dfs -lsr /


 Rohit Bakhshi



 www.hortonworks.com (http://www.hortonworks.com/)





 On Friday, February 17, 2012 at 12:18 PM, Sujit Dhamale wrote:

 Hi,i am new to Hadoop

 i install Hadoop cluster on my Ubuntu

 using Below command i created one folder *foodir*
 *~bin/hadoop dfs -mkdir /foodir*

 in my file system , what will be location of folder *foodir* ??




-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about


Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)

2012-02-17 Thread Edward Capriolo
I would almost agree with prospective. But their is a problem with 'java is
slow' theory. The reason is that in a 100 percent write workload gc might
be a factor.

But in the real world people have to read data and read becomes disk bound
as your data gets larger then memory.

Unless C++ can make your disk spin faster then java It is a wash. Making a
claim that your going to need more servers for java/hbase is bogus. To put
it in prospective, if the workload is 5 % write and 95 % read you are
probably going to need just the same amount of hardware.

You might get some win on the read size because your custom caching could
be more efficient in terms of object size in memory and other gc issues but
it is not 2 or 3 to one.

If a million writes fall into a hypertable forest but it take a billion
years to read them back did the writes ever sync :)


On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote:
 Hey Todd,

 Bulk loading isn't always an option when data is streaming in from a live
 application.  Many big data use cases involve massive amounts of smaller
 items in the size range of 10-100 bytes, for example URLs, sensor
readings,
 genome sequence reads, network traffic logs, etc.  If HBase requires 2-3
 times the amount of hardware to avoid *Concurrent mode failures*, then
that
 makes HBase 2-3 times more expensive from the standpoint of hardware,
power
 consumption, and datacenter real estate.

 What takes the most time is getting the core database mechanics right
 (we're going on 5 years now).  Once the core database is stable,
 integration with applications such as Solr and others are short term
 projects.  I believe that sooner or later, most engineers working in this
 space will come to the conclusion that Java is the wrong language for this
 kind of database application.  At that point, folks on the HBase project
 will realize that they are five years behind.

 - Doug

 On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon t...@cloudera.com wrote:

 Hey Doug,

 Want to also run a comparison test with inter-cluster replication
 turned on? How about kerberos-based security on secure HDFS? How about
 ACLs or other table permissions even without strong authentication?
 Can you run a test comparing performance running on top of Hadoop
 0.23? How about running other ecosystem products like Solbase,
 Havrobase, and Lily, or commercial products like Digital Reasoning's
 Synthesys, etc?

 For those unfamiliar, the answer to all of the above is that those
 comparisons can't be run because Hypertable is years behind HBase in
 terms of features, adoption, etc. They've found a set of benchmarks
 they win at, but bulk loading either database through the put API is
 the wrong way to go about it anyway. Anyone loading 5T of data like
 this would use the bulk load APIs which are one to two orders of
 magnitude more efficient. Just ask the Yahoo crawl cache team, who has
 ~1PB stored in HBase, or Facebook, or eBay, or many others who store
 hundreds to thousands of TBs in HBase today.

 Thanks,
 -Todd

 On Mon, Feb 13, 2012 at 9:07 AM, Doug Judd d...@hypertable.com wrote:
  In our original test, we mistakenly ran the HBase test with
  the hbase.hregion.memstore.mslab.enabled property set to false.  We
 re-ran
  the test with the hbase.hregion.memstore.mslab.enabled property set to
 true
  and have reported the results in the following addendum:
 
  Addendum to Hypertable vs. HBase Performance
  Test
 http://www.hypertable.com/why_hypertable/hypertable_vs_hbase_2/addendum/
 
  Synopsis: It slowed performance on the 10KB and 1KB tests and still
 failed
  the 100 byte and 10 byte tests with *Concurrent mode failure*
 
  - Doug



 --
 Todd Lipcon
 Software Engineer, Cloudera




Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)

2012-02-17 Thread Doug Judd
Hi Edward,

The problem is that even if the workload is 5% write and 95% read, if you
can't load the data, you need more machines.  In the 167 billion insert
test, HBase failed with *Concurrent mode failure* after 20% of the data was
loaded.  One of our customers has loaded 1/2 trillion records of historical
financial market data on 16 machines.  If you do the back-of-the-envelope
calculation, it would take about 180 machines for HBase to load 1/2
trillion cells.  That makes HBase 10X more expensive in terms of hardware,
power consumption, and data center real estate.

- Doug

On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I would almost agree with prospective. But their is a problem with 'java is
 slow' theory. The reason is that in a 100 percent write workload gc might
 be a factor.

 But in the real world people have to read data and read becomes disk bound
 as your data gets larger then memory.

 Unless C++ can make your disk spin faster then java It is a wash. Making a
 claim that your going to need more servers for java/hbase is bogus. To put
 it in prospective, if the workload is 5 % write and 95 % read you are
 probably going to need just the same amount of hardware.

 You might get some win on the read size because your custom caching could
 be more efficient in terms of object size in memory and other gc issues but
 it is not 2 or 3 to one.

 If a million writes fall into a hypertable forest but it take a billion
 years to read them back did the writes ever sync :)


 On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote:
  Hey Todd,
 
  Bulk loading isn't always an option when data is streaming in from a live
  application.  Many big data use cases involve massive amounts of smaller
  items in the size range of 10-100 bytes, for example URLs, sensor
 readings,
  genome sequence reads, network traffic logs, etc.  If HBase requires 2-3
  times the amount of hardware to avoid *Concurrent mode failures*, then
 that
  makes HBase 2-3 times more expensive from the standpoint of hardware,
 power
  consumption, and datacenter real estate.
 
  What takes the most time is getting the core database mechanics right
  (we're going on 5 years now).  Once the core database is stable,
  integration with applications such as Solr and others are short term
  projects.  I believe that sooner or later, most engineers working in this
  space will come to the conclusion that Java is the wrong language for
 this
  kind of database application.  At that point, folks on the HBase project
  will realize that they are five years behind.
 
  - Doug
 
  On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon t...@cloudera.com wrote:
 
  Hey Doug,
 
  Want to also run a comparison test with inter-cluster replication
  turned on? How about kerberos-based security on secure HDFS? How about
  ACLs or other table permissions even without strong authentication?
  Can you run a test comparing performance running on top of Hadoop
  0.23? How about running other ecosystem products like Solbase,
  Havrobase, and Lily, or commercial products like Digital Reasoning's
  Synthesys, etc?
 
  For those unfamiliar, the answer to all of the above is that those
  comparisons can't be run because Hypertable is years behind HBase in
  terms of features, adoption, etc. They've found a set of benchmarks
  they win at, but bulk loading either database through the put API is
  the wrong way to go about it anyway. Anyone loading 5T of data like
  this would use the bulk load APIs which are one to two orders of
  magnitude more efficient. Just ask the Yahoo crawl cache team, who has
  ~1PB stored in HBase, or Facebook, or eBay, or many others who store
  hundreds to thousands of TBs in HBase today.
 
  Thanks,
  -Todd
 
  On Mon, Feb 13, 2012 at 9:07 AM, Doug Judd d...@hypertable.com wrote:
   In our original test, we mistakenly ran the HBase test with
   the hbase.hregion.memstore.mslab.enabled property set to false.  We
  re-ran
   the test with the hbase.hregion.memstore.mslab.enabled property set to
  true
   and have reported the results in the following addendum:
  
   Addendum to Hypertable vs. HBase Performance
   Test
 
 http://www.hypertable.com/why_hypertable/hypertable_vs_hbase_2/addendum/
  
   Synopsis: It slowed performance on the 10KB and 1KB tests and still
  failed
   the 100 byte and 10 byte tests with *Concurrent mode failure*
  
   - Doug
 
 
 
  --
  Todd Lipcon
  Software Engineer, Cloudera
 
 



Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)

2012-02-17 Thread Edward Capriolo
As your numbers show.

Dataset
SizeHypertable
Queries/s   HBase
Queries/s   Hypertable
Latency (ms)HBase
Latency (ms)
0.5 TB  3256.42 2969.52 157.221 172.351
5 TB2450.01 2066.52 208.972 247.680

Raw data goes up. Read performance goes down. Latency goes up.

You mentioned you loaded 1/2 trillion records of historical financial
data. The operative word is historical. Your not doing 1/2 trillion
writes every day.

Most of the system that use structured log formats can write very fast
(I am guessing that is what hypertable uses btw). DD writes very fast
as well, but if you want acceptable read latency you are going to need
a good RAM/disk ratio.

Even at 0.5 TB 157.221ms is not a great read latency, so your ability
to write fast has already outstripped your ability to read at a rate
that could support say web application. (I come from a world of 1-5ms
latency BTW).

What application can you support with numbers like that? An email
compliance system where you want to store a ton of data, but only plan
of doing 1 search a day to make an auditor happy? :) This is why I say
your going to end up needing about the same # of nodes because when it
comes time to read this data having a machine with 4Tb of data and 24
GB ram is not going to cut it.

You are right on a couple of fronts
1) being able to load data fast is good (can't argue with that)
2) If hbase can't load X entries that is bad

I really can't imagine that hbase blows up and just stops accepting
inserts at one point. You seem to say its happening and I don't have
time to verify. But if you are at the point where you are getting
175ms random and 85 zipfan latency what are you proving that is
already more data then a server can handle.

http://en.wikipedia.org/wiki/Network_performance

Users browsing the Internet feel that responses are instant when
delays are less than 100 ms from click to response[11]. Latency and
throughput together affect the perceived speed of a connection.
However, the perceived performance of a connection can still vary
widely, depending in part on the type of information transmitted and
how it is used.



On Fri, Feb 17, 2012 at 7:25 PM, Doug Judd d...@hypertable.com wrote:
 Hi Edward,

 The problem is that even if the workload is 5% write and 95% read, if you
 can't load the data, you need more machines.  In the 167 billion insert
 test, HBase failed with *Concurrent mode failure* after 20% of the data was
 loaded.  One of our customers has loaded 1/2 trillion records of historical
 financial market data on 16 machines.  If you do the back-of-the-envelope
 calculation, it would take about 180 machines for HBase to load 1/2
 trillion cells.  That makes HBase 10X more expensive in terms of hardware,
 power consumption, and data center real estate.

 - Doug

 On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I would almost agree with prospective. But their is a problem with 'java is
 slow' theory. The reason is that in a 100 percent write workload gc might
 be a factor.

 But in the real world people have to read data and read becomes disk bound
 as your data gets larger then memory.

 Unless C++ can make your disk spin faster then java It is a wash. Making a
 claim that your going to need more servers for java/hbase is bogus. To put
 it in prospective, if the workload is 5 % write and 95 % read you are
 probably going to need just the same amount of hardware.

 You might get some win on the read size because your custom caching could
 be more efficient in terms of object size in memory and other gc issues but
 it is not 2 or 3 to one.

 If a million writes fall into a hypertable forest but it take a billion
 years to read them back did the writes ever sync :)


 On Monday, February 13, 2012, Doug Judd d...@hypertable.com wrote:
  Hey Todd,
 
  Bulk loading isn't always an option when data is streaming in from a live
  application.  Many big data use cases involve massive amounts of smaller
  items in the size range of 10-100 bytes, for example URLs, sensor
 readings,
  genome sequence reads, network traffic logs, etc.  If HBase requires 2-3
  times the amount of hardware to avoid *Concurrent mode failures*, then
 that
  makes HBase 2-3 times more expensive from the standpoint of hardware,
 power
  consumption, and datacenter real estate.
 
  What takes the most time is getting the core database mechanics right
  (we're going on 5 years now).  Once the core database is stable,
  integration with applications such as Solr and others are short term
  projects.  I believe that sooner or later, most engineers working in this
  space will come to the conclusion that Java is the wrong language for
 this
  kind of database application.  At that point, folks on the HBase project
  will realize that they are five years behind.
 
  - Doug
 
  On Mon, Feb 13, 2012 at 11:33 AM, Todd Lipcon t...@cloudera.com wrote:
 
  Hey Doug,
 
  Want to also run a 

Re: Processing small xml files

2012-02-17 Thread Mohit Anchlia
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote:

 I'm not sure what you mean by flat format here.

 In my scenario, I have an file input.xml that looks like this.

 myfile
   section
  value1/value
   /section
   section
  value2/value
   /section
 /myfile

 input.xml is a plain text file. Not a sequence file. If I read it with the
 XMLInputFormat my mapper gets called with (key, value) pairs that look like
 this:

 (, sectionvalue1/value/section)
 (, sectionvalue2/value/section)

 Where the keys are numerical offsets into the file. I then use this
 information to write a sequence file with these (key, value) pairs. So my
 Hadoop job that uses XMLInputFormat takes a text file as input and produces
 a sequence file as output.

 I don't know a rule of thumb for how many small files is too many. Maybe
 someone else on the list can chime in. I just know that when your
 throughput gets slow that's one possible cause to investigate.


I need to install hadoop. Does this xmlinput format comes as part of the
install? Can you please give me some pointers that would help me install
hadoop and xmlinputformat if necessary?


Re: Processing small xml files

2012-02-17 Thread Srinivas Surasani
Hi Mohit,

You can use Pig for processing XML files. PiggyBank has build in load
function to load the XML files.
 Also you can specify pig.maxCombinedSplitSize  and
pig.splitCombination for efficient processing.

On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote:

 I'm not sure what you mean by flat format here.

 In my scenario, I have an file input.xml that looks like this.

 myfile
   section
      value1/value
   /section
   section
      value2/value
   /section
 /myfile

 input.xml is a plain text file. Not a sequence file. If I read it with the
 XMLInputFormat my mapper gets called with (key, value) pairs that look like
 this:

 (, sectionvalue1/value/section)
 (, sectionvalue2/value/section)

 Where the keys are numerical offsets into the file. I then use this
 information to write a sequence file with these (key, value) pairs. So my
 Hadoop job that uses XMLInputFormat takes a text file as input and produces
 a sequence file as output.

 I don't know a rule of thumb for how many small files is too many. Maybe
 someone else on the list can chime in. I just know that when your
 throughput gets slow that's one possible cause to investigate.


 I need to install hadoop. Does this xmlinput format comes as part of the
 install? Can you please give me some pointers that would help me install
 hadoop and xmlinputformat if necessary?



-- 
-- Srinivas
srini...@cloudwick.com