RE: OOM Error Map output copy.

2011-12-08 Thread Devaraj K
Hi Niranjan,

Every thing looks ok as per the info you have given. Can you check
in the job.xml file whether these child opts reflecting or any thing else is
overwriting this config.

3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC


and also can you tell me which version of hadoop using?


Devaraj K 

-Original Message-
From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] 
Sent: Thursday, December 08, 2011 12:21 AM
To: common-user@hadoop.apache.org
Subject: OOM Error Map output copy.

All 

I am encountering the following out-of-memory error during the reduce phase
of a large job.

Map output copy failure : java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
mory(ReduceTask.java:1669)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
t(ReduceTask.java:1529)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
ReduceTask.java:1378)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
ask.java:1310)
I tried increasing the memory available using mapped.child.java.opts but
that only helps a little. The reduce task eventually fails again. Here are
some relevant job configuration details:

1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers
filter out a small percentage of the input ( less than 1%).

2. I am currently using 12 reducers and I can't increase this count by much
to ensure availability of reduce slots for other users. 

3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC

4. mapred.job.shuffle.input.buffer.percent  -- 0.70

5. mapred.job.shuffle.merge.percent -- 0.66

6. mapred.inmem.merge.threshold -- 1000

7. I have nearly 5000 mappers which are supposed to produce LZO compressed
outputs. The logs seem to indicate that the map outputs range between 0.3G
to 0.8GB. 

Does anything here seem amiss? I'd appreciate any input of what settings to
try. I can try different reduced values for the input buffer percent and the
merge percent.  Given that the job runs for about 7-8 hours before crashing,
I would like to make some informed choices if possible.

Thanks. 
~ Niranjan.






RE: HDFS Backup nodes

2011-12-08 Thread Jorn Argelo - Ephorus
Hi Koji,

This was on CHD3U1. For the record I had the dfs.name.dir.restore which
Harsh mentioned enabled as well.

Jorn

-Oorspronkelijk bericht-
Van: Koji Noguchi [mailto:knogu...@yahoo-inc.com] 
Verzonden: woensdag 7 december 2011 17:59
Aan: common-user@hadoop.apache.org
Onderwerp: Re: HDFS Backup nodes

Hi Jorn, 

Which hadoop version were you using when you hit that issue?

Koji


On 12/7/11 5:25 AM, Jorn Argelo - Ephorus jorn.arg...@ephorus.com
wrote:

 Just to add to that note - we've ran into an issue where the NFS share
 was out of sync (the namenode storage failed even though the NFS share
 was working), but the other local metadata was fine. At the restart of
 the namenode it picked the NFS share's fsimage even if it was out of
 sync. This had the effect that loads of blocks were marked as invalid
 and deleted by the datanodes, and the namenode never came out of safe
 mode because it was missing blocks. The Hadoop documentation says it
 always picks the most recent version of the fsimage but in my case
this
 doesn't seem to have happened. Maybe a bug? With that said I've been
 having issues with NFS before (the NFS namenode storage always failed
 every hour even if the cluster was idle).
 
 Now since this was just test data it wasn't all that important ... but
 if that would happen with your production cluster you got yourself a
 problem. I've moved away from NFS and I'm using DRBD instead. Not
having
 any problems anymore whatsoever.
 
 YMMV.
 
 Jorn
 
 -Oorspronkelijk bericht-
 Van: Joey Echeverria [mailto:j...@cloudera.com]
 Verzonden: woensdag 7 december 2011 12:08
 Aan: common-user@hadoop.apache.org
 Onderwerp: Re: HDFS Backup nodes
 
 You should also configure the Namenode to use an NFS mount for one of
 it's storage directories. That will give the most up-to-date back of
 the metadata in case of total node failure.
 
 -Joey
 
 On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar
praveen...@gmail.com
 wrote:
 This means still we are relying on Secondary NameNode idealogy for
 Namenode's backup.
 Can OS-mirroring of Namenode is a good alternative keep it alive all
 the
 time ?
 
 Thanks,
 Praveenesh
 
 On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
 mahesw...@huawei.comwrote:
 
 AFAIK backup node introduced in 0.21 version onwards.
 
 From: praveenesh kumar [praveen...@gmail.com]
 Sent: Wednesday, December 07, 2011 12:40 PM
 To: common-user@hadoop.apache.org
 Subject: HDFS Backup nodes
 
 Does hadoop 0.20.205 supports configuring HDFS backup nodes ?
 
 Thanks,
 Praveenesh
 
 
 



Routing and region deletes

2011-12-08 Thread Per Steffensen

Hi

The system we are going to work on will receive 50mio+ new datarecords 
every day. We need to keep a history of 2 years of data (thats 35+ 
billion datarecords in the storage all in all), and that basically means 
that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 
billion every month. We plan to store the datarecords in HBase.


Is it somehow possible to tell HBase to put (route) all datarecords 
belonging to a specific date or month to a designated set of regions 
(and route nothing else there), so that deleting all data belonging to 
that day/month i basically deleting those regions entirely? And is 
explicit deletion of entire regions possible at all?


The reason I want to do this is that I expect it to be much faster than 
doing explicit deletion record by record of 50mio+ records every day.


Regards, Per Steffensen




Re: Routing and region deletes

2011-12-08 Thread Michel Segel
Per Seffensen,

I would urge you to step away from the keyboard and rethink your design.
It sounds like you want to replicate a date partition model similar to what you 
would do if you were attempting this with HBase.

HBase is not a relational database and you have a different way of doing things.

You could put the date/time stamp in the key such that your data is sorted by 
date.
However, this would cause hot spots.  Think about how you access the data. It 
sounds like you access the more recent data more frequently than historical 
data.  This is a bad idea in HBase.
(note: it may still make sense to do this ... You have to think more about the 
data and consider alternatives.)

I personally would hash the key for even distribution, again depending on the 
data access pattern.  (hashed data means you can't do range queries but again, 
it depends on what you are doing...)

You also have to think about how you purge the data. You don't just drop a 
region. Doing a full table scan once a month to delete may not be a bad thing. 
Again it depends on what you are doing...

Just my opinion. Others will have their own... Now I'm stepping away from the 
keyboard to get my morning coffee...
:-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 8, 2011, at 7:13 AM, Per Steffensen st...@designware.dk wrote:

 Hi
 
 The system we are going to work on will receive 50mio+ new datarecords every 
 day. We need to keep a history of 2 years of data (thats 35+ billion 
 datarecords in the storage all in all), and that basically means that we also 
 need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. 
 We plan to store the datarecords in HBase.
 
 Is it somehow possible to tell HBase to put (route) all datarecords belonging 
 to a specific date or month to a designated set of regions (and route nothing 
 else there), so that deleting all data belonging to that day/month i 
 basically deleting those regions entirely? And is explicit deletion of entire 
 regions possible at all?
 
 The reason I want to do this is that I expect it to be much faster than doing 
 explicit deletion record by record of 50mio+ records every day.
 
 Regards, Per Steffensen
 
 
 


Re: Routing and region deletes

2011-12-08 Thread Per Steffensen

Thanks for your reply!

Michel Segel skrev:

Per Seffensen,

I would urge you to step away from the keyboard and rethink your design.
  
Will do :-) But would actually still like to receive answers for my 
questions - just pretend that my ideas are not so stupid and let me know 
if it can be done

It sounds like you want to replicate a date partition model similar to what you 
would do if you were attempting this with HBase.

HBase is not a relational database and you have a different way of doing things.
  

I know

You could put the date/time stamp in the key such that your data is sorted by 
date.
  
But I guess that would not guarantee that records with timestamps from a 
specific day or month all exist in the same set of regions and that 
records with timestamps from other days or months all exist outside 
those regions, so that I can delete records from that day or month, just 
by deleting the regions.

However, this would cause hot spots.  Think about how you access the data. It 
sounds like you access the more recent data more frequently than historical 
data.
Not necessarily wrt reading, but certainly I (almost) only write new 
records with timestamps from the current day/month.

  This is a bad idea in HBase.
(note: it may still make sense to do this ... You have to think more about the 
data and consider alternatives.)

I personally would hash the key for even distribution, again depending on the 
data access pattern.  (hashed data means you can't do range queries but again, 
it depends on what you are doing...)

You also have to think about how you purge the data. You don't just drop a 
region.
I know that this is not the default way of deleting data, but it is 
possible? Believe a region is basically just a folder with a set of 
files and deleting those would be a matter of a few ms. So if I can 
route all records with timestamps from a certain day or month to a 
designated set of regions, deleting all those records will be a matter 
of deleting #regions-in-that-set folders on disk - very quick. The 
alternative is to do 50mio+ single delete operations every day (or 1,5 
billion operations every month), and that will not even free up space 
immediately since the records will actually just be marked deleted (in a 
new file) - space will not be freed before next compaction of the 
involved regions (see e.g. http://outerthought.org/blog/465-ot.html).

 Doing a full table scan once a month to delete may not be a bad thing.
But I dont believe one full table scan will be enough. For that to be 
possible, at least I would have to be able to provide HBase with all 1,5 
billion records to delete in one delete-call - thats probably not 
possible :-)

 Again it depends on what you are doing...

Just my opinion. Others will have their own... Now I'm stepping away from the 
keyboard to get my morning coffee...
  

Enjoy. Then I will consider leaving work (its late afternoon in Europe)

:-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 8, 2011, at 7:13 AM, Per Steffensen st...@designware.dk wrote:

  

Hi

The system we are going to work on will receive 50mio+ new datarecords every 
day. We need to keep a history of 2 years of data (thats 35+ billion 
datarecords in the storage all in all), and that basically means that we also 
need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. 
We plan to store the datarecords in HBase.

Is it somehow possible to tell HBase to put (route) all datarecords belonging 
to a specific date or month to a designated set of regions (and route nothing 
else there), so that deleting all data belonging to that day/month i basically 
deleting those regions entirely? And is explicit deletion of entire regions 
possible at all?

The reason I want to do this is that I expect it to be much faster than doing 
explicit deletion record by record of 50mio+ records every day.

Regards, Per Steffensen






  




Re: OOM Error Map output copy.

2011-12-08 Thread Niranjan Balasubramanian
Devaraj

These are indeed the actual settings I copied over from the job.xml. 

~ Niranjan.
On Dec 8, 2011, at 12:10 AM, Devaraj K wrote:

 Hi Niranjan,
 
   Every thing looks ok as per the info you have given. Can you check
 in the job.xml file whether these child opts reflecting or any thing else is
 overwriting this config.
   
 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC
 
 
 and also can you tell me which version of hadoop using?
 
 
 Devaraj K 
 
 -Original Message-
 From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] 
 Sent: Thursday, December 08, 2011 12:21 AM
 To: common-user@hadoop.apache.org
 Subject: OOM Error Map output copy.
 
 All 
 
 I am encountering the following out-of-memory error during the reduce phase
 of a large job.
 
 Map output copy failure : java.lang.OutOfMemoryError: Java heap space
   at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
 mory(ReduceTask.java:1669)
   at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
 t(ReduceTask.java:1529)
   at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
 ReduceTask.java:1378)
   at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
 ask.java:1310)
 I tried increasing the memory available using mapped.child.java.opts but
 that only helps a little. The reduce task eventually fails again. Here are
 some relevant job configuration details:
 
 1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers
 filter out a small percentage of the input ( less than 1%).
 
 2. I am currently using 12 reducers and I can't increase this count by much
 to ensure availability of reduce slots for other users. 
 
 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC
 
 4. mapred.job.shuffle.input.buffer.percent-- 0.70
 
 5. mapred.job.shuffle.merge.percent   -- 0.66
 
 6. mapred.inmem.merge.threshold   -- 1000
 
 7. I have nearly 5000 mappers which are supposed to produce LZO compressed
 outputs. The logs seem to indicate that the map outputs range between 0.3G
 to 0.8GB. 
 
 Does anything here seem amiss? I'd appreciate any input of what settings to
 try. I can try different reduced values for the input buffer percent and the
 merge percent.  Given that the job runs for about 7-8 hours before crashing,
 I would like to make some informed choices if possible.
 
 Thanks. 
 ~ Niranjan.
 
 
 
 



Re: OOM Error Map output copy.

2011-12-08 Thread Niranjan Balasubramanian
I am using version 0.20.203.

Thanks
~ Niranjan.
On Dec 8, 2011, at 9:26 AM, Niranjan Balasubramanian wrote:

 Devaraj
 
 These are indeed the actual settings I copied over from the job.xml. 
 
 ~ Niranjan.
 On Dec 8, 2011, at 12:10 AM, Devaraj K wrote:
 
 Hi Niranjan,
 
  Every thing looks ok as per the info you have given. Can you check
 in the job.xml file whether these child opts reflecting or any thing else is
 overwriting this config.
  
 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC
 
 
 and also can you tell me which version of hadoop using?
 
 
 Devaraj K 
 
 -Original Message-
 From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] 
 Sent: Thursday, December 08, 2011 12:21 AM
 To: common-user@hadoop.apache.org
 Subject: OOM Error Map output copy.
 
 All 
 
 I am encountering the following out-of-memory error during the reduce phase
 of a large job.
 
 Map output copy failure : java.lang.OutOfMemoryError: Java heap space
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
 mory(ReduceTask.java:1669)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
 t(ReduceTask.java:1529)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
 ReduceTask.java:1378)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
 ask.java:1310)
 I tried increasing the memory available using mapped.child.java.opts but
 that only helps a little. The reduce task eventually fails again. Here are
 some relevant job configuration details:
 
 1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers
 filter out a small percentage of the input ( less than 1%).
 
 2. I am currently using 12 reducers and I can't increase this count by much
 to ensure availability of reduce slots for other users. 
 
 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC
 
 4. mapred.job.shuffle.input.buffer.percent   -- 0.70
 
 5. mapred.job.shuffle.merge.percent  -- 0.66
 
 6. mapred.inmem.merge.threshold  -- 1000
 
 7. I have nearly 5000 mappers which are supposed to produce LZO compressed
 outputs. The logs seem to indicate that the map outputs range between 0.3G
 to 0.8GB. 
 
 Does anything here seem amiss? I'd appreciate any input of what settings to
 try. I can try different reduced values for the input buffer percent and the
 merge percent.  Given that the job runs for about 7-8 hours before crashing,
 I would like to make some informed choices if possible.
 
 Thanks. 
 ~ Niranjan.
 
 
 
 
 



Cloudera Free

2011-12-08 Thread Bai Shen
Does anyone know of a good tutorial for Cloudera Free?  I found
installation instructions, but there doesn't seem to be in formation on how
to run jobs, etc, once you have it set up.

Thanks.


Question about accessing another HDFS

2011-12-08 Thread Frank Astier
Hi -

We have two namenodes set up at our company, say:

hdfs://A.mycompany.com
hdfs://B.mycompany.com

From the command line, I can do:

Hadoop fs –ls hdfs://A.mycompany.com//some-dir

And

Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir

I’m now trying to do the same from a Java program that uses the HDFS API. No 
luck there. I get an exception: “Wrong FS”.

Any idea what I’m missing in my Java program??

Thanks,

Frank


Re: Question about accessing another HDFS

2011-12-08 Thread Tom Melendez
I'm hoping there is a better answer, but I'm thinking you could load
another configuration file (with B.company in it) using Configuration,
grab a FileSystem obj with that and then go forward.  Seems like some
unnecessary overhead though.

Thanks,

Tom

On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com wrote:
 Hi -

 We have two namenodes set up at our company, say:

 hdfs://A.mycompany.com
 hdfs://B.mycompany.com

 From the command line, I can do:

 Hadoop fs –ls hdfs://A.mycompany.com//some-dir

 And

 Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir

 I’m now trying to do the same from a Java program that uses the HDFS API. No 
 luck there. I get an exception: “Wrong FS”.

 Any idea what I’m missing in my Java program??

 Thanks,

 Frank


Re: Question about accessing another HDFS

2011-12-08 Thread Jay Vyas
Can you show your code here ?  What URL protocol are you using ?

On Thu, Dec 8, 2011 at 5:47 PM, Tom Melendez t...@supertom.com wrote:

 I'm hoping there is a better answer, but I'm thinking you could load
 another configuration file (with B.company in it) using Configuration,
 grab a FileSystem obj with that and then go forward.  Seems like some
 unnecessary overhead though.

 Thanks,

 Tom

 On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com
 wrote:
  Hi -
 
  We have two namenodes set up at our company, say:
 
  hdfs://A.mycompany.com
  hdfs://B.mycompany.com
 
  From the command line, I can do:
 
  Hadoop fs –ls hdfs://A.mycompany.com//some-dir
 
  And
 
  Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir
 
  I’m now trying to do the same from a Java program that uses the HDFS
 API. No luck there. I get an exception: “Wrong FS”.
 
  Any idea what I’m missing in my Java program??
 
  Thanks,
 
  Frank




-- 
Jay Vyas
MMSB/UCHC


Re: Question about accessing another HDFS

2011-12-08 Thread Frank Astier
Can you show your code here ?  What URL protocol are you using ?

I’m guess I’m being very naïve (and relatively new to HDFS). I can’t show too 
much code, but basically, I’d like to do:

Path myPath = new Path(“hdfs://A.mycompany.com//some-dir”);

Where Path is a hadoop fs path. I think I can take it from there, if that 
worked... Did you mean that I need to address the namenode with an http:// 
address?

Thanks!

Frank

On Thu, Dec 8, 2011 at 5:47 PM, Tom Melendez t...@supertom.com wrote:

 I'm hoping there is a better answer, but I'm thinking you could load
 another configuration file (with B.company in it) using Configuration,
 grab a FileSystem obj with that and then go forward.  Seems like some
 unnecessary overhead though.

 Thanks,

 Tom

 On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com
 wrote:
  Hi -
 
  We have two namenodes set up at our company, say:
 
  hdfs://A.mycompany.com
  hdfs://B.mycompany.com
 
  From the command line, I can do:
 
  Hadoop fs –ls hdfs://A.mycompany.com//some-dir
 
  And
 
  Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir
 
  I’m now trying to do the same from a Java program that uses the HDFS
 API. No luck there. I get an exception: “Wrong FS”.
 
  Any idea what I’m missing in my Java program??
 
  Thanks,
 
  Frank




--
Jay Vyas
MMSB/UCHC


Re: Cloudera Free

2011-12-08 Thread Joey Echeverria
Hi Bai,

I'm moving this over to scm-us...@cloudera.org as that's a more
appropriate list. (common-user bcced).

I assume by Cloudera Free you mean Coudera Manager Free Edition?

You should be able to run a job in the same way that do on any other
Hadoop cluster. The only caveat is that you first need to download
configuration files for your clients. There's information here on how
do you that:

https://ccp.cloudera.com/display/express37/Generating+Client+Configuration+Files

Assuming you put the files from the generated zip file in a directory
at $HOME/hadoop-conf, you'd run a job like follows:

hadoop --config $HOME/hadoop-conf jar
/usr/lib/hadoop/hadoop-0.20.2-cdh3u2-examples.jar pi 10 1

This shows running the example job which calculates pi.

-Joey

On Thu, Dec 8, 2011 at 4:31 PM, Bai Shen baishen.li...@gmail.com wrote:
 Does anyone know of a good tutorial for Cloudera Free?  I found
 installation instructions, but there doesn't seem to be in formation on how
 to run jobs, etc, once you have it set up.

 Thanks.



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Question about accessing another HDFS

2011-12-08 Thread JAX
I was confused about this for a while also I dont have all the details but 
I think my question on s.o. might help you.

I was playing with different protocols...
Trying to find a way to programatically access all data in Hfds.

http://stackoverflow.com/questions/7844458/how-can-i-access-hadoop-via-the-hdfs-protocol-from-java

Jay Vyas 
MMSB
UCHC

On Dec 8, 2011, at 7:29 PM, Frank Astier fast...@yahoo-inc.com wrote:

 Can you show your code here ?  What URL protocol are you using ?
 
 I’m guess I’m being very naïve (and relatively new to HDFS). I can’t show too 
 much code, but basically, I’d like to do:
 
 Path myPath = new Path(“hdfs://A.mycompany.com//some-dir”);
 
 Where Path is a hadoop fs path. I think I can take it from there, if that 
 worked... Did you mean that I need to address the namenode with an http:// 
 address?
 
 Thanks!
 
 Frank
 
 On Thu, Dec 8, 2011 at 5:47 PM, Tom Melendez t...@supertom.com wrote:
 
 I'm hoping there is a better answer, but I'm thinking you could load
 another configuration file (with B.company in it) using Configuration,
 grab a FileSystem obj with that and then go forward.  Seems like some
 unnecessary overhead though.
 
 Thanks,
 
 Tom
 
 On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com
 wrote:
 Hi -
 
 We have two namenodes set up at our company, say:
 
 hdfs://A.mycompany.com
 hdfs://B.mycompany.com
 
 From the command line, I can do:
 
 Hadoop fs –ls hdfs://A.mycompany.com//some-dir
 
 And
 
 Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir
 
 I’m now trying to do the same from a Java program that uses the HDFS
 API. No luck there. I get an exception: “Wrong FS”.
 
 Any idea what I’m missing in my Java program??
 
 Thanks,
 
 Frank
 
 
 
 
 --
 Jay Vyas
 MMSB/UCHC


Regarding Parrallel Iron's claim

2011-12-08 Thread JS Jang

Hi,

Does anyone know any discussion in Apache Hadoop regarding the claim by 
Parrallel Iron with their patent against use of HDFS?

Thanks in advance.

Regards,
JS




Re: Regarding Parrallel Iron's claim

2011-12-08 Thread Jean-Daniel Cryans
Isn't that old news?

http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/

Googling around, doesn't seem anything happened after that.

J-D

On Thu, Dec 8, 2011 at 6:52 PM, JS Jang jsja...@gmail.com wrote:
 Hi,

 Does anyone know any discussion in Apache Hadoop regarding the claim by
 Parrallel Iron with their patent against use of HDFS?
 Thanks in advance.

 Regards,
 JS




how to integrate snappy into hadoop 0.20.205.0(apache release)

2011-12-08 Thread Jinyan Xu
Hi all,

Can anyone tell me how to integrate snappy into hadoop 0.20.205.0(apache 
release)? Not cloudera version.

Thanks!


The information and any attached documents contained in this message
may be confidential and/or legally privileged. The message is
intended solely for the addressee(s). If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful. If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.


Re: Regarding Parrallel Iron's claim

2011-12-08 Thread JS Jang

I appreciate your help, J-D.
Yes, I wondered whether there was any update since or previous 
discussion within Apache Hadoop as I am new in this mailing list.


On 12/9/11 12:19 PM, Jean-Daniel Cryans wrote:

Isn't that old news?

http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/

Googling around, doesn't seem anything happened after that.

J-D

On Thu, Dec 8, 2011 at 6:52 PM, JS Jangjsja...@gmail.com  wrote:

Hi,

Does anyone know any discussion in Apache Hadoop regarding the claim by
Parrallel Iron with their patent against use of HDFS?
Thanks in advance.

Regards,
JS





--

장정식 / jsj...@gruter.com
(주)그루터, RD팀 수석
www.gruter.com
Cloud, Search and Social




Re: Regarding Parrallel Iron's claim

2011-12-08 Thread Jean-Daniel Cryans
You could just look at the archives:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/

It is also indexed by all search engines.

J-D

On Thu, Dec 8, 2011 at 7:44 PM, JS Jang jsja...@gmail.com wrote:
 I appreciate your help, J-D.
 Yes, I wondered whether there was any update since or previous discussion
 within Apache Hadoop as I am new in this mailing list.


 On 12/9/11 12:19 PM, Jean-Daniel Cryans wrote:

 Isn't that old news?

 http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/

 Googling around, doesn't seem anything happened after that.

 J-D

 On Thu, Dec 8, 2011 at 6:52 PM, JS Jangjsja...@gmail.com  wrote:

 Hi,

 Does anyone know any discussion in Apache Hadoop regarding the claim by
 Parrallel Iron with their patent against use of HDFS?
 Thanks in advance.

 Regards,
 JS




 --
 
 장정식 / jsj...@gruter.com
 (주)그루터, RD팀 수석
 www.gruter.com
 Cloud, Search and Social
 



Re: Regarding Parrallel Iron's claim

2011-12-08 Thread JS Jang
Got it. Thanks again, J-D.

On 12/9/11 12:54 PM, Jean-Daniel Cryans wrote:
 You could just look at the archives:
 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/

 It is also indexed by all search engines.

 J-D

 On Thu, Dec 8, 2011 at 7:44 PM, JS Jang jsja...@gmail.com wrote:
 I appreciate your help, J-D.
 Yes, I wondered whether there was any update since or previous discussion
 within Apache Hadoop as I am new in this mailing list.


 On 12/9/11 12:19 PM, Jean-Daniel Cryans wrote:
 Isn't that old news?

 http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/

 Googling around, doesn't seem anything happened after that.

 J-D

 On Thu, Dec 8, 2011 at 6:52 PM, JS Jangjsja...@gmail.com  wrote:
 Hi,

 Does anyone know any discussion in Apache Hadoop regarding the claim by
 Parrallel Iron with their patent against use of HDFS?
 Thanks in advance.

 Regards,
 JS



 --
 
 장정식 / jsj...@gruter.com
 (주)그루터, RD팀 수석
 www.gruter.com
 Cloud, Search and Social
 



-- 

장정식 / jsj...@gruter.com
(주)그루터, RD팀 수석
www.gruter.com
Cloud, Search and Social




Re: Routing and region deletes

2011-12-08 Thread Per Steffensen
Ahhh stupid me. I probably just want to use different tables for 
different days/months. Believe tables can fairly quickly be deleted on 
HBase?


Regards, Per Steffensen

Per Steffensen skrev:

Thanks for your reply!

Michel Segel skrev:

Per Seffensen,

I would urge you to step away from the keyboard and rethink your design.
  
Will do :-) But would actually still like to receive answers for my 
questions - just pretend that my ideas are not so stupid and let me 
know if it can be done
It sounds like you want to replicate a date partition model similar 
to what you would do if you were attempting this with HBase.


HBase is not a relational database and you have a different way of 
doing things.
  

I know
You could put the date/time stamp in the key such that your data is 
sorted by date.
  
But I guess that would not guarantee that records with timestamps from 
a specific day or month all exist in the same set of regions and that 
records with timestamps from other days or months all exist outside 
those regions, so that I can delete records from that day or month, 
just by deleting the regions.
However, this would cause hot spots.  Think about how you access the 
data. It sounds like you access the more recent data more frequently 
than historical data.
Not necessarily wrt reading, but certainly I (almost) only write new 
records with timestamps from the current day/month.

  This is a bad idea in HBase.
(note: it may still make sense to do this ... You have to think more 
about the data and consider alternatives.)


I personally would hash the key for even distribution, again 
depending on the data access pattern.  (hashed data means you can't 
do range queries but again, it depends on what you are doing...)


You also have to think about how you purge the data. You don't just 
drop a region.
I know that this is not the default way of deleting data, but it is 
possible? Believe a region is basically just a folder with a set of 
files and deleting those would be a matter of a few ms. So if I can 
route all records with timestamps from a certain day or month to a 
designated set of regions, deleting all those records will be a matter 
of deleting #regions-in-that-set folders on disk - very quick. The 
alternative is to do 50mio+ single delete operations every day (or 1,5 
billion operations every month), and that will not even free up space 
immediately since the records will actually just be marked deleted (in 
a new file) - space will not be freed before next compaction of the 
involved regions (see e.g. http://outerthought.org/blog/465-ot.html).

 Doing a full table scan once a month to delete may not be a bad thing.
But I dont believe one full table scan will be enough. For that to be 
possible, at least I would have to be able to provide HBase with all 
1,5 billion records to delete in one delete-call - thats probably 
not possible :-)

 Again it depends on what you are doing...

Just my opinion. Others will have their own... Now I'm stepping away 
from the keyboard to get my morning coffee...
  

Enjoy. Then I will consider leaving work (its late afternoon in Europe)

:-)


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 8, 2011, at 7:13 AM, Per Steffensen st...@designware.dk wrote:

 

Hi

The system we are going to work on will receive 50mio+ new 
datarecords every day. We need to keep a history of 2 years of data 
(thats 35+ billion datarecords in the storage all in all), and that 
basically means that we also need to delete 50mio+ datarecords every 
day, or e.g. 1,5 billion every month. We plan to store the 
datarecords in HBase.


Is it somehow possible to tell HBase to put (route) all datarecords 
belonging to a specific date or month to a designated set of regions 
(and route nothing else there), so that deleting all data belonging 
to that day/month i basically deleting those regions entirely? And 
is explicit deletion of entire regions possible at all?


The reason I want to do this is that I expect it to be much faster 
than doing explicit deletion record by record of 50mio+ records 
every day.


Regards, Per Steffensen






  







Re: Not able to post a job in Hadoop 0.23.0

2011-12-08 Thread Arun C Murthy
Moving to mapreduce-user@, bcc common-user@.

Can you see any errors in the logs? Typically this happens when you have no 
NodeManagers.

Check the 'nodes' link and then RM logs.

Arun

On Nov 29, 2011, at 8:36 PM, Nitin Khandelwal wrote:

 HI ,
 
 I have successfully setup Hadoop 0.23.0 in a single m/c. When i post a job,
 it gets posted successfully (i can see the job in UI), but the job is never
 ASSIGNED and waits forever.
 Here are details of what i see for that Job in UI
 
 
 Name: random-writer  State: ACCEPTED  FinalStatus: UNDEFINED
 Started: 30-Nov-2011
 10:08:55  Elapsed: 49sec  Tracking URL:
 UNASSIGNEDhttp://192.168.0.93:8900/cluster/app/application_1322627869620_0001#
 Diagnostics:
 AM container logs: AM not yet registered with RM  Cluster ID: 1322627869620
 ResourceManager state: STARTED  ResourceManager started on: 30-Nov-2011
 10:07:49  ResourceManager version: 0.23.0 from
 722cd694fc4ab6d040c0a34f9fb5b476e330ee60 by hortonmu source checksum
 4975bf112aa7faa5673f604045ced798 on Thu Nov 3 09:07:31 UTC 2011  Hadoop
 version: 0.23.0 from d4fee83ec1462ab9824add6449320617caa7c605 by hortonmu
 source checksum 4e42b2d96c899a98a8ab8c7cc23f27ae on Thu Nov 3 08:59:12 UTC
 2011
 Can some one tell where am i going wrong??
 
 Thanks,
 -- 
 Nitin Khandelwal



Choosing IO intensive and CPU intensive workloads

2011-12-08 Thread ArunKumar
Hi guys !

I want to see the behavior of a single node of Hadoop cluster when IO
intensive / CPU intensive workload and mix of both is submitted to the
single node alone.
These workloads must stress the nodes.
I see that TestDFSIO benchmark is good for IO intensive workload.  
1 Which benchmarks do i need to use for this ? 
2 What amount of input data will be fair enough for seeing the behavior
under these workloads for each type of boxes if i have boxes with :-
 B1: 4 GB RAM, Dual  core ,150-250 GB DISK ,
 B2 : 1GB RAM, 50-80 GB Disk.

Arun

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Choosing-IO-intensive-and-CPU-intensive-workloads-tp3572282p3572282.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Not able to post a job in Hadoop 0.23.0

2011-12-08 Thread Nitin Khandelwal
Hi Arun,
Thanks for your reply.


There is one NodeManager running ; Following is the NodeManager UI :

Rack
Node State
Node Address
Node HTTP Address
Health-status
Last health-update
Health-report
Containers
Mem Used
Mem Avail
 /default-rack RUNNING germinait93:50033 germinait93: Healthy 9-Dec-2011
13:03:33 Healthy 0 0 KB 1 GB
Also, I get to see only following Logs relevant to the job posting :

2011-12-09 13:10:57,300 INFO  fifo.FifoScheduler (FifoScheduler.java:
addApplication(288)) - Application Submission:
application_1323416004722_0002 from minal.kothari, currently
 active: 1
2011-12-09 13:10:57,300 INFO  attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:handle(464)) - Processing event for
appattempt_1323416004722_0002_01 of type APP_ACCEPTED
2011-12-09 13:10:57,317 INFO  attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:handle(476)) - appattempt_1323416004722_0002_01
State change from SUBMITTED to SCHEDULED
2011-12-09 13:10:57,318 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(416))
- Processing event for application_1323416004722_0002 of type APP_ACCEPTED
2011-12-09 13:10:57,318 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(428))
- application_1323416004722_0002 State change from SUBMITTED to ACCEPTED
2011-12-09 13:10:57,320 INFO  resourcemanager.RMAuditLogger
(RMAuditLogger.java:logSuccess(140)) - USER=minal.kothari   IP=192.168.0.93
OPERATION=Submit Application Request
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1323416004722_0002


Please let me know if you need some other logs .

Thanks,
Nitin




On 9 December 2011 12:44, Arun C Murthy a...@hortonworks.com wrote:

 Moving to mapreduce-user@, bcc common-user@.

 Can you see any errors in the logs? Typically this happens when you have
 no NodeManagers.

 Check the 'nodes' link and then RM logs.

 Arun

 On Nov 29, 2011, at 8:36 PM, Nitin Khandelwal wrote:

  HI ,
 
  I have successfully setup Hadoop 0.23.0 in a single m/c. When i post a
 job,
  it gets posted successfully (i can see the job in UI), but the job is
 never
  ASSIGNED and waits forever.
  Here are details of what i see for that Job in UI
 
 
  Name: random-writer  State: ACCEPTED  FinalStatus: UNDEFINED
  Started: 30-Nov-2011
  10:08:55  Elapsed: 49sec  Tracking URL:
  UNASSIGNED
 http://192.168.0.93:8900/cluster/app/application_1322627869620_0001#
  Diagnostics:
  AM container logs: AM not yet registered with RM  Cluster ID:
 1322627869620
  ResourceManager state: STARTED  ResourceManager started on: 30-Nov-2011
  10:07:49  ResourceManager version: 0.23.0 from
  722cd694fc4ab6d040c0a34f9fb5b476e330ee60 by hortonmu source checksum
  4975bf112aa7faa5673f604045ced798 on Thu Nov 3 09:07:31 UTC 2011  Hadoop
  version: 0.23.0 from d4fee83ec1462ab9824add6449320617caa7c605 by hortonmu
  source checksum 4e42b2d96c899a98a8ab8c7cc23f27ae on Thu Nov 3 08:59:12
 UTC
  2011
  Can some one tell where am i going wrong??
 
  Thanks,
  --
  Nitin Khandelwal




-- 


Nitin Khandelwal


Re: Not able to post a job in Hadoop 0.23.0

2011-12-08 Thread Nitin Khandelwal
CC : mapreduce-user

On 9 December 2011 13:14, Nitin Khandelwal
nitin.khandel...@germinait.comwrote:

 Hi Arun,
 Thanks for your reply.


 There is one NodeManager running ; Following is the NodeManager UI :

 Rack
 Node State
 Node Address
 Node HTTP Address
 Health-status
 Last health-update
 Health-report
 Containers
 Mem Used
 Mem Avail
  /default-rack RUNNING germinait93:50033 germinait93: Healthy 9-Dec-2011
 13:03:33 Healthy 0 0 KB 1 GB
 Also, I get to see only following Logs relevant to the job posting :

 2011-12-09 13:10:57,300 INFO  fifo.FifoScheduler (FifoScheduler.java:
 addApplication(288)) - Application Submission:
 application_1323416004722_0002 from minal.kothari, currently
  active: 1
 2011-12-09 13:10:57,300 INFO  attempt.RMAppAttemptImpl
 (RMAppAttemptImpl.java:handle(464)) - Processing event for
 appattempt_1323416004722_0002_01 of type APP_ACCEPTED
 2011-12-09 13:10:57,317 INFO  attempt.RMAppAttemptImpl
 (RMAppAttemptImpl.java:handle(476)) - appattempt_1323416004722_0002_01
 State change from SUBMITTED to SCHEDULED
 2011-12-09 13:10:57,318 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(416))
 - Processing event for application_1323416004722_0002 of type APP_ACCEPTED
 2011-12-09 13:10:57,318 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(428))
 - application_1323416004722_0002 State change from SUBMITTED to ACCEPTED
 2011-12-09 13:10:57,320 INFO  resourcemanager.RMAuditLogger
 (RMAuditLogger.java:logSuccess(140)) - USER=minal.kothari   IP=192.168.0.93
 OPERATION=Submit Application Request
 TARGET=ClientRMService  RESULT=SUCCESS
 APPID=application_1323416004722_0002


 Please let me know if you need some other logs .

 Thanks,
 Nitin




 On 9 December 2011 12:44, Arun C Murthy a...@hortonworks.com wrote:

 Moving to mapreduce-user@, bcc common-user@.

 Can you see any errors in the logs? Typically this happens when you have
 no NodeManagers.

 Check the 'nodes' link and then RM logs.

 Arun

 On Nov 29, 2011, at 8:36 PM, Nitin Khandelwal wrote:

  HI ,
 
  I have successfully setup Hadoop 0.23.0 in a single m/c. When i post a
 job,
  it gets posted successfully (i can see the job in UI), but the job is
 never
  ASSIGNED and waits forever.
  Here are details of what i see for that Job in UI
 
 
  Name: random-writer  State: ACCEPTED  FinalStatus: UNDEFINED
  Started: 30-Nov-2011
  10:08:55  Elapsed: 49sec  Tracking URL:
  UNASSIGNED
 http://192.168.0.93:8900/cluster/app/application_1322627869620_0001#
  Diagnostics:
  AM container logs: AM not yet registered with RM  Cluster ID:
 1322627869620
  ResourceManager state: STARTED  ResourceManager started on: 30-Nov-2011
  10:07:49  ResourceManager version: 0.23.0 from
  722cd694fc4ab6d040c0a34f9fb5b476e330ee60 by hortonmu source checksum
  4975bf112aa7faa5673f604045ced798 on Thu Nov 3 09:07:31 UTC 2011  Hadoop
  version: 0.23.0 from d4fee83ec1462ab9824add6449320617caa7c605 by
 hortonmu
  source checksum 4e42b2d96c899a98a8ab8c7cc23f27ae on Thu Nov 3 08:59:12
 UTC
  2011
  Can some one tell where am i going wrong??
 
  Thanks,
  --
  Nitin Khandelwal




 --


 Nitin Khandelwal





-- 


Nitin Khandelwal


Re: Choosing IO intensive and CPU intensive workloads

2011-12-08 Thread alo alt
Hi Arun,

Micheal has write up a good tutorial about, including stress test and IO.
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

- Alex

On Fri, Dec 9, 2011 at 8:24 AM, ArunKumar arunk...@gmail.com wrote:

 Hi guys !

 I want to see the behavior of a single node of Hadoop cluster when IO
 intensive / CPU intensive workload and mix of both is submitted to the
 single node alone.
 These workloads must stress the nodes.
 I see that TestDFSIO benchmark is good for IO intensive workload.
 1 Which benchmarks do i need to use for this ?
 2 What amount of input data will be fair enough for seeing the behavior
 under these workloads for each type of boxes if i have boxes with :-
  B1: 4 GB RAM, Dual  core ,150-250 GB DISK ,
  B2 : 1GB RAM, 50-80 GB Disk.

 Arun

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Choosing-IO-intensive-and-CPU-intensive-workloads-tp3572282p3572282.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*