RE: OOM Error Map output copy.
Hi Niranjan, Every thing looks ok as per the info you have given. Can you check in the job.xml file whether these child opts reflecting or any thing else is overwriting this config. 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC and also can you tell me which version of hadoop using? Devaraj K -Original Message- From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] Sent: Thursday, December 08, 2011 12:21 AM To: common-user@hadoop.apache.org Subject: OOM Error Map output copy. All I am encountering the following out-of-memory error during the reduce phase of a large job. Map output copy failure : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe mory(ReduceTask.java:1669) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu t(ReduceTask.java:1529) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput( ReduceTask.java:1378) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT ask.java:1310) I tried increasing the memory available using mapped.child.java.opts but that only helps a little. The reduce task eventually fails again. Here are some relevant job configuration details: 1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers filter out a small percentage of the input ( less than 1%). 2. I am currently using 12 reducers and I can't increase this count by much to ensure availability of reduce slots for other users. 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC 4. mapred.job.shuffle.input.buffer.percent -- 0.70 5. mapred.job.shuffle.merge.percent -- 0.66 6. mapred.inmem.merge.threshold -- 1000 7. I have nearly 5000 mappers which are supposed to produce LZO compressed outputs. The logs seem to indicate that the map outputs range between 0.3G to 0.8GB. Does anything here seem amiss? I'd appreciate any input of what settings to try. I can try different reduced values for the input buffer percent and the merge percent. Given that the job runs for about 7-8 hours before crashing, I would like to make some informed choices if possible. Thanks. ~ Niranjan.
RE: HDFS Backup nodes
Hi Koji, This was on CHD3U1. For the record I had the dfs.name.dir.restore which Harsh mentioned enabled as well. Jorn -Oorspronkelijk bericht- Van: Koji Noguchi [mailto:knogu...@yahoo-inc.com] Verzonden: woensdag 7 december 2011 17:59 Aan: common-user@hadoop.apache.org Onderwerp: Re: HDFS Backup nodes Hi Jorn, Which hadoop version were you using when you hit that issue? Koji On 12/7/11 5:25 AM, Jorn Argelo - Ephorus jorn.arg...@ephorus.com wrote: Just to add to that note - we've ran into an issue where the NFS share was out of sync (the namenode storage failed even though the NFS share was working), but the other local metadata was fine. At the restart of the namenode it picked the NFS share's fsimage even if it was out of sync. This had the effect that loads of blocks were marked as invalid and deleted by the datanodes, and the namenode never came out of safe mode because it was missing blocks. The Hadoop documentation says it always picks the most recent version of the fsimage but in my case this doesn't seem to have happened. Maybe a bug? With that said I've been having issues with NFS before (the NFS namenode storage always failed every hour even if the cluster was idle). Now since this was just test data it wasn't all that important ... but if that would happen with your production cluster you got yourself a problem. I've moved away from NFS and I'm using DRBD instead. Not having any problems anymore whatsoever. YMMV. Jorn -Oorspronkelijk bericht- Van: Joey Echeverria [mailto:j...@cloudera.com] Verzonden: woensdag 7 december 2011 12:08 Aan: common-user@hadoop.apache.org Onderwerp: Re: HDFS Backup nodes You should also configure the Namenode to use an NFS mount for one of it's storage directories. That will give the most up-to-date back of the metadata in case of total node failure. -Joey On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar praveen...@gmail.com wrote: This means still we are relying on Secondary NameNode idealogy for Namenode's backup. Can OS-mirroring of Namenode is a good alternative keep it alive all the time ? Thanks, Praveenesh On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G mahesw...@huawei.comwrote: AFAIK backup node introduced in 0.21 version onwards. From: praveenesh kumar [praveen...@gmail.com] Sent: Wednesday, December 07, 2011 12:40 PM To: common-user@hadoop.apache.org Subject: HDFS Backup nodes Does hadoop 0.20.205 supports configuring HDFS backup nodes ? Thanks, Praveenesh
Routing and region deletes
Hi The system we are going to work on will receive 50mio+ new datarecords every day. We need to keep a history of 2 years of data (thats 35+ billion datarecords in the storage all in all), and that basically means that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. We plan to store the datarecords in HBase. Is it somehow possible to tell HBase to put (route) all datarecords belonging to a specific date or month to a designated set of regions (and route nothing else there), so that deleting all data belonging to that day/month i basically deleting those regions entirely? And is explicit deletion of entire regions possible at all? The reason I want to do this is that I expect it to be much faster than doing explicit deletion record by record of 50mio+ records every day. Regards, Per Steffensen
Re: Routing and region deletes
Per Seffensen, I would urge you to step away from the keyboard and rethink your design. It sounds like you want to replicate a date partition model similar to what you would do if you were attempting this with HBase. HBase is not a relational database and you have a different way of doing things. You could put the date/time stamp in the key such that your data is sorted by date. However, this would cause hot spots. Think about how you access the data. It sounds like you access the more recent data more frequently than historical data. This is a bad idea in HBase. (note: it may still make sense to do this ... You have to think more about the data and consider alternatives.) I personally would hash the key for even distribution, again depending on the data access pattern. (hashed data means you can't do range queries but again, it depends on what you are doing...) You also have to think about how you purge the data. You don't just drop a region. Doing a full table scan once a month to delete may not be a bad thing. Again it depends on what you are doing... Just my opinion. Others will have their own... Now I'm stepping away from the keyboard to get my morning coffee... :-) Sent from a remote device. Please excuse any typos... Mike Segel On Dec 8, 2011, at 7:13 AM, Per Steffensen st...@designware.dk wrote: Hi The system we are going to work on will receive 50mio+ new datarecords every day. We need to keep a history of 2 years of data (thats 35+ billion datarecords in the storage all in all), and that basically means that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. We plan to store the datarecords in HBase. Is it somehow possible to tell HBase to put (route) all datarecords belonging to a specific date or month to a designated set of regions (and route nothing else there), so that deleting all data belonging to that day/month i basically deleting those regions entirely? And is explicit deletion of entire regions possible at all? The reason I want to do this is that I expect it to be much faster than doing explicit deletion record by record of 50mio+ records every day. Regards, Per Steffensen
Re: Routing and region deletes
Thanks for your reply! Michel Segel skrev: Per Seffensen, I would urge you to step away from the keyboard and rethink your design. Will do :-) But would actually still like to receive answers for my questions - just pretend that my ideas are not so stupid and let me know if it can be done It sounds like you want to replicate a date partition model similar to what you would do if you were attempting this with HBase. HBase is not a relational database and you have a different way of doing things. I know You could put the date/time stamp in the key such that your data is sorted by date. But I guess that would not guarantee that records with timestamps from a specific day or month all exist in the same set of regions and that records with timestamps from other days or months all exist outside those regions, so that I can delete records from that day or month, just by deleting the regions. However, this would cause hot spots. Think about how you access the data. It sounds like you access the more recent data more frequently than historical data. Not necessarily wrt reading, but certainly I (almost) only write new records with timestamps from the current day/month. This is a bad idea in HBase. (note: it may still make sense to do this ... You have to think more about the data and consider alternatives.) I personally would hash the key for even distribution, again depending on the data access pattern. (hashed data means you can't do range queries but again, it depends on what you are doing...) You also have to think about how you purge the data. You don't just drop a region. I know that this is not the default way of deleting data, but it is possible? Believe a region is basically just a folder with a set of files and deleting those would be a matter of a few ms. So if I can route all records with timestamps from a certain day or month to a designated set of regions, deleting all those records will be a matter of deleting #regions-in-that-set folders on disk - very quick. The alternative is to do 50mio+ single delete operations every day (or 1,5 billion operations every month), and that will not even free up space immediately since the records will actually just be marked deleted (in a new file) - space will not be freed before next compaction of the involved regions (see e.g. http://outerthought.org/blog/465-ot.html). Doing a full table scan once a month to delete may not be a bad thing. But I dont believe one full table scan will be enough. For that to be possible, at least I would have to be able to provide HBase with all 1,5 billion records to delete in one delete-call - thats probably not possible :-) Again it depends on what you are doing... Just my opinion. Others will have their own... Now I'm stepping away from the keyboard to get my morning coffee... Enjoy. Then I will consider leaving work (its late afternoon in Europe) :-) Sent from a remote device. Please excuse any typos... Mike Segel On Dec 8, 2011, at 7:13 AM, Per Steffensen st...@designware.dk wrote: Hi The system we are going to work on will receive 50mio+ new datarecords every day. We need to keep a history of 2 years of data (thats 35+ billion datarecords in the storage all in all), and that basically means that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. We plan to store the datarecords in HBase. Is it somehow possible to tell HBase to put (route) all datarecords belonging to a specific date or month to a designated set of regions (and route nothing else there), so that deleting all data belonging to that day/month i basically deleting those regions entirely? And is explicit deletion of entire regions possible at all? The reason I want to do this is that I expect it to be much faster than doing explicit deletion record by record of 50mio+ records every day. Regards, Per Steffensen
Re: OOM Error Map output copy.
Devaraj These are indeed the actual settings I copied over from the job.xml. ~ Niranjan. On Dec 8, 2011, at 12:10 AM, Devaraj K wrote: Hi Niranjan, Every thing looks ok as per the info you have given. Can you check in the job.xml file whether these child opts reflecting or any thing else is overwriting this config. 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC and also can you tell me which version of hadoop using? Devaraj K -Original Message- From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] Sent: Thursday, December 08, 2011 12:21 AM To: common-user@hadoop.apache.org Subject: OOM Error Map output copy. All I am encountering the following out-of-memory error during the reduce phase of a large job. Map output copy failure : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe mory(ReduceTask.java:1669) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu t(ReduceTask.java:1529) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput( ReduceTask.java:1378) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT ask.java:1310) I tried increasing the memory available using mapped.child.java.opts but that only helps a little. The reduce task eventually fails again. Here are some relevant job configuration details: 1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers filter out a small percentage of the input ( less than 1%). 2. I am currently using 12 reducers and I can't increase this count by much to ensure availability of reduce slots for other users. 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC 4. mapred.job.shuffle.input.buffer.percent-- 0.70 5. mapred.job.shuffle.merge.percent -- 0.66 6. mapred.inmem.merge.threshold -- 1000 7. I have nearly 5000 mappers which are supposed to produce LZO compressed outputs. The logs seem to indicate that the map outputs range between 0.3G to 0.8GB. Does anything here seem amiss? I'd appreciate any input of what settings to try. I can try different reduced values for the input buffer percent and the merge percent. Given that the job runs for about 7-8 hours before crashing, I would like to make some informed choices if possible. Thanks. ~ Niranjan.
Re: OOM Error Map output copy.
I am using version 0.20.203. Thanks ~ Niranjan. On Dec 8, 2011, at 9:26 AM, Niranjan Balasubramanian wrote: Devaraj These are indeed the actual settings I copied over from the job.xml. ~ Niranjan. On Dec 8, 2011, at 12:10 AM, Devaraj K wrote: Hi Niranjan, Every thing looks ok as per the info you have given. Can you check in the job.xml file whether these child opts reflecting or any thing else is overwriting this config. 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC and also can you tell me which version of hadoop using? Devaraj K -Original Message- From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] Sent: Thursday, December 08, 2011 12:21 AM To: common-user@hadoop.apache.org Subject: OOM Error Map output copy. All I am encountering the following out-of-memory error during the reduce phase of a large job. Map output copy failure : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe mory(ReduceTask.java:1669) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu t(ReduceTask.java:1529) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput( ReduceTask.java:1378) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT ask.java:1310) I tried increasing the memory available using mapped.child.java.opts but that only helps a little. The reduce task eventually fails again. Here are some relevant job configuration details: 1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers filter out a small percentage of the input ( less than 1%). 2. I am currently using 12 reducers and I can't increase this count by much to ensure availability of reduce slots for other users. 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC 4. mapred.job.shuffle.input.buffer.percent -- 0.70 5. mapred.job.shuffle.merge.percent -- 0.66 6. mapred.inmem.merge.threshold -- 1000 7. I have nearly 5000 mappers which are supposed to produce LZO compressed outputs. The logs seem to indicate that the map outputs range between 0.3G to 0.8GB. Does anything here seem amiss? I'd appreciate any input of what settings to try. I can try different reduced values for the input buffer percent and the merge percent. Given that the job runs for about 7-8 hours before crashing, I would like to make some informed choices if possible. Thanks. ~ Niranjan.
Cloudera Free
Does anyone know of a good tutorial for Cloudera Free? I found installation instructions, but there doesn't seem to be in formation on how to run jobs, etc, once you have it set up. Thanks.
Question about accessing another HDFS
Hi - We have two namenodes set up at our company, say: hdfs://A.mycompany.com hdfs://B.mycompany.com From the command line, I can do: Hadoop fs –ls hdfs://A.mycompany.com//some-dir And Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir I’m now trying to do the same from a Java program that uses the HDFS API. No luck there. I get an exception: “Wrong FS”. Any idea what I’m missing in my Java program?? Thanks, Frank
Re: Question about accessing another HDFS
I'm hoping there is a better answer, but I'm thinking you could load another configuration file (with B.company in it) using Configuration, grab a FileSystem obj with that and then go forward. Seems like some unnecessary overhead though. Thanks, Tom On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com wrote: Hi - We have two namenodes set up at our company, say: hdfs://A.mycompany.com hdfs://B.mycompany.com From the command line, I can do: Hadoop fs –ls hdfs://A.mycompany.com//some-dir And Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir I’m now trying to do the same from a Java program that uses the HDFS API. No luck there. I get an exception: “Wrong FS”. Any idea what I’m missing in my Java program?? Thanks, Frank
Re: Question about accessing another HDFS
Can you show your code here ? What URL protocol are you using ? On Thu, Dec 8, 2011 at 5:47 PM, Tom Melendez t...@supertom.com wrote: I'm hoping there is a better answer, but I'm thinking you could load another configuration file (with B.company in it) using Configuration, grab a FileSystem obj with that and then go forward. Seems like some unnecessary overhead though. Thanks, Tom On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com wrote: Hi - We have two namenodes set up at our company, say: hdfs://A.mycompany.com hdfs://B.mycompany.com From the command line, I can do: Hadoop fs –ls hdfs://A.mycompany.com//some-dir And Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir I’m now trying to do the same from a Java program that uses the HDFS API. No luck there. I get an exception: “Wrong FS”. Any idea what I’m missing in my Java program?? Thanks, Frank -- Jay Vyas MMSB/UCHC
Re: Question about accessing another HDFS
Can you show your code here ? What URL protocol are you using ? I’m guess I’m being very naïve (and relatively new to HDFS). I can’t show too much code, but basically, I’d like to do: Path myPath = new Path(“hdfs://A.mycompany.com//some-dir”); Where Path is a hadoop fs path. I think I can take it from there, if that worked... Did you mean that I need to address the namenode with an http:// address? Thanks! Frank On Thu, Dec 8, 2011 at 5:47 PM, Tom Melendez t...@supertom.com wrote: I'm hoping there is a better answer, but I'm thinking you could load another configuration file (with B.company in it) using Configuration, grab a FileSystem obj with that and then go forward. Seems like some unnecessary overhead though. Thanks, Tom On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com wrote: Hi - We have two namenodes set up at our company, say: hdfs://A.mycompany.com hdfs://B.mycompany.com From the command line, I can do: Hadoop fs –ls hdfs://A.mycompany.com//some-dir And Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir I’m now trying to do the same from a Java program that uses the HDFS API. No luck there. I get an exception: “Wrong FS”. Any idea what I’m missing in my Java program?? Thanks, Frank -- Jay Vyas MMSB/UCHC
Re: Cloudera Free
Hi Bai, I'm moving this over to scm-us...@cloudera.org as that's a more appropriate list. (common-user bcced). I assume by Cloudera Free you mean Coudera Manager Free Edition? You should be able to run a job in the same way that do on any other Hadoop cluster. The only caveat is that you first need to download configuration files for your clients. There's information here on how do you that: https://ccp.cloudera.com/display/express37/Generating+Client+Configuration+Files Assuming you put the files from the generated zip file in a directory at $HOME/hadoop-conf, you'd run a job like follows: hadoop --config $HOME/hadoop-conf jar /usr/lib/hadoop/hadoop-0.20.2-cdh3u2-examples.jar pi 10 1 This shows running the example job which calculates pi. -Joey On Thu, Dec 8, 2011 at 4:31 PM, Bai Shen baishen.li...@gmail.com wrote: Does anyone know of a good tutorial for Cloudera Free? I found installation instructions, but there doesn't seem to be in formation on how to run jobs, etc, once you have it set up. Thanks. -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Question about accessing another HDFS
I was confused about this for a while also I dont have all the details but I think my question on s.o. might help you. I was playing with different protocols... Trying to find a way to programatically access all data in Hfds. http://stackoverflow.com/questions/7844458/how-can-i-access-hadoop-via-the-hdfs-protocol-from-java Jay Vyas MMSB UCHC On Dec 8, 2011, at 7:29 PM, Frank Astier fast...@yahoo-inc.com wrote: Can you show your code here ? What URL protocol are you using ? I’m guess I’m being very naïve (and relatively new to HDFS). I can’t show too much code, but basically, I’d like to do: Path myPath = new Path(“hdfs://A.mycompany.com//some-dir”); Where Path is a hadoop fs path. I think I can take it from there, if that worked... Did you mean that I need to address the namenode with an http:// address? Thanks! Frank On Thu, Dec 8, 2011 at 5:47 PM, Tom Melendez t...@supertom.com wrote: I'm hoping there is a better answer, but I'm thinking you could load another configuration file (with B.company in it) using Configuration, grab a FileSystem obj with that and then go forward. Seems like some unnecessary overhead though. Thanks, Tom On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com wrote: Hi - We have two namenodes set up at our company, say: hdfs://A.mycompany.com hdfs://B.mycompany.com From the command line, I can do: Hadoop fs –ls hdfs://A.mycompany.com//some-dir And Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir I’m now trying to do the same from a Java program that uses the HDFS API. No luck there. I get an exception: “Wrong FS”. Any idea what I’m missing in my Java program?? Thanks, Frank -- Jay Vyas MMSB/UCHC
Regarding Parrallel Iron's claim
Hi, Does anyone know any discussion in Apache Hadoop regarding the claim by Parrallel Iron with their patent against use of HDFS? Thanks in advance. Regards, JS
Re: Regarding Parrallel Iron's claim
Isn't that old news? http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/ Googling around, doesn't seem anything happened after that. J-D On Thu, Dec 8, 2011 at 6:52 PM, JS Jang jsja...@gmail.com wrote: Hi, Does anyone know any discussion in Apache Hadoop regarding the claim by Parrallel Iron with their patent against use of HDFS? Thanks in advance. Regards, JS
how to integrate snappy into hadoop 0.20.205.0(apache release)
Hi all, Can anyone tell me how to integrate snappy into hadoop 0.20.205.0(apache release)? Not cloudera version. Thanks! The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message.
Re: Regarding Parrallel Iron's claim
I appreciate your help, J-D. Yes, I wondered whether there was any update since or previous discussion within Apache Hadoop as I am new in this mailing list. On 12/9/11 12:19 PM, Jean-Daniel Cryans wrote: Isn't that old news? http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/ Googling around, doesn't seem anything happened after that. J-D On Thu, Dec 8, 2011 at 6:52 PM, JS Jangjsja...@gmail.com wrote: Hi, Does anyone know any discussion in Apache Hadoop regarding the claim by Parrallel Iron with their patent against use of HDFS? Thanks in advance. Regards, JS -- 장정식 / jsj...@gruter.com (주)그루터, RD팀 수석 www.gruter.com Cloud, Search and Social
Re: Regarding Parrallel Iron's claim
You could just look at the archives: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/ It is also indexed by all search engines. J-D On Thu, Dec 8, 2011 at 7:44 PM, JS Jang jsja...@gmail.com wrote: I appreciate your help, J-D. Yes, I wondered whether there was any update since or previous discussion within Apache Hadoop as I am new in this mailing list. On 12/9/11 12:19 PM, Jean-Daniel Cryans wrote: Isn't that old news? http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/ Googling around, doesn't seem anything happened after that. J-D On Thu, Dec 8, 2011 at 6:52 PM, JS Jangjsja...@gmail.com wrote: Hi, Does anyone know any discussion in Apache Hadoop regarding the claim by Parrallel Iron with their patent against use of HDFS? Thanks in advance. Regards, JS -- 장정식 / jsj...@gruter.com (주)그루터, RD팀 수석 www.gruter.com Cloud, Search and Social
Re: Regarding Parrallel Iron's claim
Got it. Thanks again, J-D. On 12/9/11 12:54 PM, Jean-Daniel Cryans wrote: You could just look at the archives: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/ It is also indexed by all search engines. J-D On Thu, Dec 8, 2011 at 7:44 PM, JS Jang jsja...@gmail.com wrote: I appreciate your help, J-D. Yes, I wondered whether there was any update since or previous discussion within Apache Hadoop as I am new in this mailing list. On 12/9/11 12:19 PM, Jean-Daniel Cryans wrote: Isn't that old news? http://www.dbms2.com/2011/06/10/patent-nonsense-parallel-ironhdfs-edition/ Googling around, doesn't seem anything happened after that. J-D On Thu, Dec 8, 2011 at 6:52 PM, JS Jangjsja...@gmail.com wrote: Hi, Does anyone know any discussion in Apache Hadoop regarding the claim by Parrallel Iron with their patent against use of HDFS? Thanks in advance. Regards, JS -- 장정식 / jsj...@gruter.com (주)그루터, RD팀 수석 www.gruter.com Cloud, Search and Social -- 장정식 / jsj...@gruter.com (주)그루터, RD팀 수석 www.gruter.com Cloud, Search and Social
Re: Routing and region deletes
Ahhh stupid me. I probably just want to use different tables for different days/months. Believe tables can fairly quickly be deleted on HBase? Regards, Per Steffensen Per Steffensen skrev: Thanks for your reply! Michel Segel skrev: Per Seffensen, I would urge you to step away from the keyboard and rethink your design. Will do :-) But would actually still like to receive answers for my questions - just pretend that my ideas are not so stupid and let me know if it can be done It sounds like you want to replicate a date partition model similar to what you would do if you were attempting this with HBase. HBase is not a relational database and you have a different way of doing things. I know You could put the date/time stamp in the key such that your data is sorted by date. But I guess that would not guarantee that records with timestamps from a specific day or month all exist in the same set of regions and that records with timestamps from other days or months all exist outside those regions, so that I can delete records from that day or month, just by deleting the regions. However, this would cause hot spots. Think about how you access the data. It sounds like you access the more recent data more frequently than historical data. Not necessarily wrt reading, but certainly I (almost) only write new records with timestamps from the current day/month. This is a bad idea in HBase. (note: it may still make sense to do this ... You have to think more about the data and consider alternatives.) I personally would hash the key for even distribution, again depending on the data access pattern. (hashed data means you can't do range queries but again, it depends on what you are doing...) You also have to think about how you purge the data. You don't just drop a region. I know that this is not the default way of deleting data, but it is possible? Believe a region is basically just a folder with a set of files and deleting those would be a matter of a few ms. So if I can route all records with timestamps from a certain day or month to a designated set of regions, deleting all those records will be a matter of deleting #regions-in-that-set folders on disk - very quick. The alternative is to do 50mio+ single delete operations every day (or 1,5 billion operations every month), and that will not even free up space immediately since the records will actually just be marked deleted (in a new file) - space will not be freed before next compaction of the involved regions (see e.g. http://outerthought.org/blog/465-ot.html). Doing a full table scan once a month to delete may not be a bad thing. But I dont believe one full table scan will be enough. For that to be possible, at least I would have to be able to provide HBase with all 1,5 billion records to delete in one delete-call - thats probably not possible :-) Again it depends on what you are doing... Just my opinion. Others will have their own... Now I'm stepping away from the keyboard to get my morning coffee... Enjoy. Then I will consider leaving work (its late afternoon in Europe) :-) Sent from a remote device. Please excuse any typos... Mike Segel On Dec 8, 2011, at 7:13 AM, Per Steffensen st...@designware.dk wrote: Hi The system we are going to work on will receive 50mio+ new datarecords every day. We need to keep a history of 2 years of data (thats 35+ billion datarecords in the storage all in all), and that basically means that we also need to delete 50mio+ datarecords every day, or e.g. 1,5 billion every month. We plan to store the datarecords in HBase. Is it somehow possible to tell HBase to put (route) all datarecords belonging to a specific date or month to a designated set of regions (and route nothing else there), so that deleting all data belonging to that day/month i basically deleting those regions entirely? And is explicit deletion of entire regions possible at all? The reason I want to do this is that I expect it to be much faster than doing explicit deletion record by record of 50mio+ records every day. Regards, Per Steffensen
Re: Not able to post a job in Hadoop 0.23.0
Moving to mapreduce-user@, bcc common-user@. Can you see any errors in the logs? Typically this happens when you have no NodeManagers. Check the 'nodes' link and then RM logs. Arun On Nov 29, 2011, at 8:36 PM, Nitin Khandelwal wrote: HI , I have successfully setup Hadoop 0.23.0 in a single m/c. When i post a job, it gets posted successfully (i can see the job in UI), but the job is never ASSIGNED and waits forever. Here are details of what i see for that Job in UI Name: random-writer State: ACCEPTED FinalStatus: UNDEFINED Started: 30-Nov-2011 10:08:55 Elapsed: 49sec Tracking URL: UNASSIGNEDhttp://192.168.0.93:8900/cluster/app/application_1322627869620_0001# Diagnostics: AM container logs: AM not yet registered with RM Cluster ID: 1322627869620 ResourceManager state: STARTED ResourceManager started on: 30-Nov-2011 10:07:49 ResourceManager version: 0.23.0 from 722cd694fc4ab6d040c0a34f9fb5b476e330ee60 by hortonmu source checksum 4975bf112aa7faa5673f604045ced798 on Thu Nov 3 09:07:31 UTC 2011 Hadoop version: 0.23.0 from d4fee83ec1462ab9824add6449320617caa7c605 by hortonmu source checksum 4e42b2d96c899a98a8ab8c7cc23f27ae on Thu Nov 3 08:59:12 UTC 2011 Can some one tell where am i going wrong?? Thanks, -- Nitin Khandelwal
Choosing IO intensive and CPU intensive workloads
Hi guys ! I want to see the behavior of a single node of Hadoop cluster when IO intensive / CPU intensive workload and mix of both is submitted to the single node alone. These workloads must stress the nodes. I see that TestDFSIO benchmark is good for IO intensive workload. 1 Which benchmarks do i need to use for this ? 2 What amount of input data will be fair enough for seeing the behavior under these workloads for each type of boxes if i have boxes with :- B1: 4 GB RAM, Dual core ,150-250 GB DISK , B2 : 1GB RAM, 50-80 GB Disk. Arun -- View this message in context: http://lucene.472066.n3.nabble.com/Choosing-IO-intensive-and-CPU-intensive-workloads-tp3572282p3572282.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Not able to post a job in Hadoop 0.23.0
Hi Arun, Thanks for your reply. There is one NodeManager running ; Following is the NodeManager UI : Rack Node State Node Address Node HTTP Address Health-status Last health-update Health-report Containers Mem Used Mem Avail /default-rack RUNNING germinait93:50033 germinait93: Healthy 9-Dec-2011 13:03:33 Healthy 0 0 KB 1 GB Also, I get to see only following Logs relevant to the job posting : 2011-12-09 13:10:57,300 INFO fifo.FifoScheduler (FifoScheduler.java: addApplication(288)) - Application Submission: application_1323416004722_0002 from minal.kothari, currently active: 1 2011-12-09 13:10:57,300 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(464)) - Processing event for appattempt_1323416004722_0002_01 of type APP_ACCEPTED 2011-12-09 13:10:57,317 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(476)) - appattempt_1323416004722_0002_01 State change from SUBMITTED to SCHEDULED 2011-12-09 13:10:57,318 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(416)) - Processing event for application_1323416004722_0002 of type APP_ACCEPTED 2011-12-09 13:10:57,318 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(428)) - application_1323416004722_0002 State change from SUBMITTED to ACCEPTED 2011-12-09 13:10:57,320 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(140)) - USER=minal.kothari IP=192.168.0.93 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1323416004722_0002 Please let me know if you need some other logs . Thanks, Nitin On 9 December 2011 12:44, Arun C Murthy a...@hortonworks.com wrote: Moving to mapreduce-user@, bcc common-user@. Can you see any errors in the logs? Typically this happens when you have no NodeManagers. Check the 'nodes' link and then RM logs. Arun On Nov 29, 2011, at 8:36 PM, Nitin Khandelwal wrote: HI , I have successfully setup Hadoop 0.23.0 in a single m/c. When i post a job, it gets posted successfully (i can see the job in UI), but the job is never ASSIGNED and waits forever. Here are details of what i see for that Job in UI Name: random-writer State: ACCEPTED FinalStatus: UNDEFINED Started: 30-Nov-2011 10:08:55 Elapsed: 49sec Tracking URL: UNASSIGNED http://192.168.0.93:8900/cluster/app/application_1322627869620_0001# Diagnostics: AM container logs: AM not yet registered with RM Cluster ID: 1322627869620 ResourceManager state: STARTED ResourceManager started on: 30-Nov-2011 10:07:49 ResourceManager version: 0.23.0 from 722cd694fc4ab6d040c0a34f9fb5b476e330ee60 by hortonmu source checksum 4975bf112aa7faa5673f604045ced798 on Thu Nov 3 09:07:31 UTC 2011 Hadoop version: 0.23.0 from d4fee83ec1462ab9824add6449320617caa7c605 by hortonmu source checksum 4e42b2d96c899a98a8ab8c7cc23f27ae on Thu Nov 3 08:59:12 UTC 2011 Can some one tell where am i going wrong?? Thanks, -- Nitin Khandelwal -- Nitin Khandelwal
Re: Not able to post a job in Hadoop 0.23.0
CC : mapreduce-user On 9 December 2011 13:14, Nitin Khandelwal nitin.khandel...@germinait.comwrote: Hi Arun, Thanks for your reply. There is one NodeManager running ; Following is the NodeManager UI : Rack Node State Node Address Node HTTP Address Health-status Last health-update Health-report Containers Mem Used Mem Avail /default-rack RUNNING germinait93:50033 germinait93: Healthy 9-Dec-2011 13:03:33 Healthy 0 0 KB 1 GB Also, I get to see only following Logs relevant to the job posting : 2011-12-09 13:10:57,300 INFO fifo.FifoScheduler (FifoScheduler.java: addApplication(288)) - Application Submission: application_1323416004722_0002 from minal.kothari, currently active: 1 2011-12-09 13:10:57,300 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(464)) - Processing event for appattempt_1323416004722_0002_01 of type APP_ACCEPTED 2011-12-09 13:10:57,317 INFO attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(476)) - appattempt_1323416004722_0002_01 State change from SUBMITTED to SCHEDULED 2011-12-09 13:10:57,318 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(416)) - Processing event for application_1323416004722_0002 of type APP_ACCEPTED 2011-12-09 13:10:57,318 INFO rmapp.RMAppImpl (RMAppImpl.java:handle(428)) - application_1323416004722_0002 State change from SUBMITTED to ACCEPTED 2011-12-09 13:10:57,320 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(140)) - USER=minal.kothari IP=192.168.0.93 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1323416004722_0002 Please let me know if you need some other logs . Thanks, Nitin On 9 December 2011 12:44, Arun C Murthy a...@hortonworks.com wrote: Moving to mapreduce-user@, bcc common-user@. Can you see any errors in the logs? Typically this happens when you have no NodeManagers. Check the 'nodes' link and then RM logs. Arun On Nov 29, 2011, at 8:36 PM, Nitin Khandelwal wrote: HI , I have successfully setup Hadoop 0.23.0 in a single m/c. When i post a job, it gets posted successfully (i can see the job in UI), but the job is never ASSIGNED and waits forever. Here are details of what i see for that Job in UI Name: random-writer State: ACCEPTED FinalStatus: UNDEFINED Started: 30-Nov-2011 10:08:55 Elapsed: 49sec Tracking URL: UNASSIGNED http://192.168.0.93:8900/cluster/app/application_1322627869620_0001# Diagnostics: AM container logs: AM not yet registered with RM Cluster ID: 1322627869620 ResourceManager state: STARTED ResourceManager started on: 30-Nov-2011 10:07:49 ResourceManager version: 0.23.0 from 722cd694fc4ab6d040c0a34f9fb5b476e330ee60 by hortonmu source checksum 4975bf112aa7faa5673f604045ced798 on Thu Nov 3 09:07:31 UTC 2011 Hadoop version: 0.23.0 from d4fee83ec1462ab9824add6449320617caa7c605 by hortonmu source checksum 4e42b2d96c899a98a8ab8c7cc23f27ae on Thu Nov 3 08:59:12 UTC 2011 Can some one tell where am i going wrong?? Thanks, -- Nitin Khandelwal -- Nitin Khandelwal -- Nitin Khandelwal
Re: Choosing IO intensive and CPU intensive workloads
Hi Arun, Micheal has write up a good tutorial about, including stress test and IO. http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/ - Alex On Fri, Dec 9, 2011 at 8:24 AM, ArunKumar arunk...@gmail.com wrote: Hi guys ! I want to see the behavior of a single node of Hadoop cluster when IO intensive / CPU intensive workload and mix of both is submitted to the single node alone. These workloads must stress the nodes. I see that TestDFSIO benchmark is good for IO intensive workload. 1 Which benchmarks do i need to use for this ? 2 What amount of input data will be fair enough for seeing the behavior under these workloads for each type of boxes if i have boxes with :- B1: 4 GB RAM, Dual core ,150-250 GB DISK , B2 : 1GB RAM, 50-80 GB Disk. Arun -- View this message in context: http://lucene.472066.n3.nabble.com/Choosing-IO-intensive-and-CPU-intensive-workloads-tp3572282p3572282.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*