Re: Pragmatic cluster backup strategies?
Will hadoop fs -rm -rf move everything to the the /trash directory or will it delete that as well? I was thinking along the lines of what you suggest, keep the original source of the data somewhere and then reprocess it all in the event of a problem. What do other people do? Do you run another cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Pragmatic cluster backup strategies?
Hi, you could set fs.trash.interval into the number of minutes you want to consider that the rm'd data will lost forever. The data will be moved into .Trash and deleted after the configured time. Second way could be to use mount.fuse to mount the HDFS and backup over that mount your data into a storage tier. That is not the best solution, but a useable way. cheers, Alex -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF On May 30, 2012, at 8:31 AM, Darrell Taylor wrote: Will hadoop fs -rm -rf move everything to the the /trash directory or will it delete that as well? I was thinking along the lines of what you suggest, keep the original source of the data somewhere and then reprocess it all in the event of a problem. What do other people do? Do you run another cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
RE: Best Practices for Upgrading Hadoop Version?
Hi, I did this upgrade on a similar cluster some weeks ago. I use the following method (all commands run with hadoop demons process owner): * Stop cluster. * Start only HDFS with : start-dfs.sh -upgrade * At this point the migration has started. * You can check the status with hadoop dfsadmin -upgradeProgress status * Now you can access files for reading. * If you find any issue can rollback migration with : start-dfs.sh -rollback * If everything seems ok you can mark the upgrade as finalized: hadoop dfsadmin -finalizeUpgrade -Original Message- From: Eli Finkelshteyn [mailto:iefin...@gmail.com] Sent: martes, 29 de mayo de 2012 20:29 To: common-user@hadoop.apache.org Subject: Best Practices for Upgrading Hadoop Version? Hi, I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3. I'm running a pretty small cluster of just 4 nodes, and it's not really being used by too many people at the moment, so I'm OK if things get dirty or it goes offline for a bit. I was looking at the tutorial at wiki.apache.org http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it seems either outdated, or missing information. Namely, from what I've noticed so far, it doesn't specify what user any of the commands should be run as. Since I'm sure this is something a lot of people have needed to do, Is there a better tutorial somewhere for updating Hadoop version in general? Eli Subject to local law, communications with Accenture and its affiliates including telephone calls and emails (including content), may be monitored by our systems for the purposes of security and the assessment of internal compliance with Accenture policy. __ www.accenture.com
Hadoop BI Usergroup Stuttgart (Germany)
For our german speaking folks, we want to start a Hadoop-BI Usergroup in Stuttgart (Germany), if you have interest in please visit our LinkedIn Group (http://www.linkedin.com/groups/Hadoop-Germany-4325443) and our Doodle poll (http://www.doodle.com/aqwsg4snbwimrsfc). If we figure out a real interest we will call for sponsors and speakers later. (german) Schwerpunkte: - Integration von Hadoop / HDFS basierenden Lösungen in bestehende Infrastrukturen - Export von Daten aus Relationalen Datenbanken in NoSQL / Analyse Cluster (HBase, Hive) - Statistische Auswertungen (Mahout) - ISO kompatible Lösungsansätze Backup und Recovery, HA mittels OpenSource -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF
Re: Best Practices for Upgrading Hadoop Version?
Michael Noll has a good description of the upgrade process here: http://www.michael-noll.com/blog/2011/08/23/performing-an-hdfs-upgrade-of-an-hadoop-cluster/ If may not quite reflect the versions of Hadoop you plan to upgrade but it has some good pointers. Chris On 30 May 2012 09:12, ramon@accenture.com wrote: Hi, I did this upgrade on a similar cluster some weeks ago. I use the following method (all commands run with hadoop demons process owner): * Stop cluster. * Start only HDFS with : start-dfs.sh -upgrade * At this point the migration has started. * You can check the status with hadoop dfsadmin -upgradeProgress status * Now you can access files for reading. * If you find any issue can rollback migration with : start-dfs.sh -rollback * If everything seems ok you can mark the upgrade as finalized: hadoop dfsadmin -finalizeUpgrade -Original Message- From: Eli Finkelshteyn [mailto:iefin...@gmail.com] Sent: martes, 29 de mayo de 2012 20:29 To: common-user@hadoop.apache.org Subject: Best Practices for Upgrading Hadoop Version? Hi, I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3. I'm running a pretty small cluster of just 4 nodes, and it's not really being used by too many people at the moment, so I'm OK if things get dirty or it goes offline for a bit. I was looking at the tutorial at wiki.apache.org http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it seems either outdated, or missing information. Namely, from what I've noticed so far, it doesn't specify what user any of the commands should be run as. Since I'm sure this is something a lot of people have needed to do, Is there a better tutorial somewhere for updating Hadoop version in general? Eli Subject to local law, communications with Accenture and its affiliates including telephone calls and emails (including content), may be monitored by our systems for the purposes of security and the assessment of internal compliance with Accenture policy. __ www.accenture.com
Re: different input/output formats
PFA. On Wed, May 30, 2012 at 2:45 AM, Mark question markq2...@gmail.com wrote: Hi Samir, can you email me your main class.. or if you can check mine, it is as follows: public class SortByNorm1 extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir outputDir\n); ToolRunner.printGenericCommandUsage(System.err); return -1; } JobConf conf = new JobConf(new Configuration(),SortByNorm1.class); conf.setJobName(SortDocByNorm1); conf.setMapperClass(Norm1Mapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setReducerClass(Norm1Reducer.class); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new SortByNorm1(), args); System.exit(exitCode); } On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark See the out put for that same Application . I am not getting any error. On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: How to mapreduce in the scenario
Yes . Hadoop Is only for Huge Dataset Computaion . May not good for small dataset. On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote: Hi, Mike, Nitin, Devaraj, Soumya, samir, Robert Thank you all for your suggestions. Actually, I want to know if hadoop has any advantage than routine database in performance for solving this kind of problem ( join data ). Best Regards, Gump On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee soumya.sbaner...@gmail.com wrote: Hi, You can also try to use the Hadoop Reduce Side Join functionality. Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and Reduce classes to do the same. Regards, Soumya. On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote: Hi Gump, Mapreduce fits well for solving these types(joins) of problem. I hope this will help you to solve the described problem.. 1. Mapoutput key and value classes : Write a map out put key class(Text.class), value class(CombinedValue.class). Here value class should be able to hold the values from both the files(a.txt and b.txt) as shown below. class CombinedValue implements WritableComparator { String name; int age; String address; boolean isLeft; // flag to identify from which file } 2. Mapper : Write a map() function which can parse from both the files(a.txt, b.txt) and produces common output key and value class. 3. Partitioner : Write the partitioner in such a way that it will Send all the (key, value) pairs to same reducer which are having same key. 4. Reducer : In the reduce() function, you will receive the records from both the files and you can combine those easily. Thanks Devaraj From: liuzhg [liu...@cernet.com] Sent: Tuesday, May 29, 2012 3:45 PM To: common-user@hadoop.apache.org Subject: How to mapreduce in the scenario Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
Re: Pragmatic cluster backup strategies?
I am not an expert on the trash so you probably want to verify everything I am about to say. I believe that trash acts oddly when you try to use it to delete a trash directory. Quotas can potentially get off when doing this, but I think it still deletes the directory. Trash is a nice feature, but I wouldn't trust it as a true backup. I just don't think it is mature enough for something like that. There are enough issues with quotas that sadly most of our users almost always add -skipTrash all the time. Where I work we do a combination of several different things depending on the project and their requirements. In some cases where there are government regulations involved we do regular tape backups. In other cases we keep the original data around for some time and can re-import it to HDFS if necessary. In other cases we will copy the data, to multiple Hadoop clusters. This is usually for the case where we want to do Hot/Warm failover between clusters. Now we may be different from most other users because we do run lots of different projects on lots of different clusters. --Bobby Evans On 5/30/12 1:31 AM, Darrell Taylor darrell.tay...@gmail.com wrote: Will hadoop fs -rm -rf move everything to the the /trash directory or will it delete that as well? I was thinking along the lines of what you suggest, keep the original source of the data somewhere and then reprocess it all in the event of a problem. What do other people do? Do you run another cluster? Do you backup specific parts of the cluster? Some form of offsite SAN? On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)
Please send you conf file contents and host file contents too.. On Tue, May 29, 2012 at 11:08 PM, Harsh J ha...@cloudera.com wrote: Rohit, The SNN may start and run infinitely without doing any work. The NN and DN have probably not started cause the NN has an issue (perhaps NN name directory isn't formatted) and the DN can't find the NN (or has data directory issues as well). So this isn't a glitch but a real issue you'll have to take a look at your logs for. On Sun, May 27, 2012 at 10:51 PM, Rohit Pandey rohitpandey...@gmail.com wrote: Hello Hadoop community, I have been trying to set up a double node Hadoop cluster (following the instructions in - http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ ) and am very close to running it apart from one small glitch - when I start the dfs (using start-dfs.sh), it says: 10.63.88.53: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out 10.63.88.109: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out 10.63.88.109: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out 10.63.88.109: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out 10.63.88.53: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out which looks like it's been successful in starting all the nodes. However, when I check them out by running 'jps', this is what I see: 27531 SecondaryNameNode 27879 Jps As you can see, there is no datanode and name node. I have been racking my brains at this for quite a while now. Checked all the inputs and every thing. Any one know what the problem might be? -- Thanks in advance, Rohit -- Harsh J -- ∞ Shashwat Shriparv
Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)
In your logs details i colud not find the NN stating. It is the Problem of NN itself. Harsh also suggested for that same. On Sun, May 27, 2012 at 10:51 PM, Rohit Pandey rohitpandey...@gmail.comwrote: Hello Hadoop community, I have been trying to set up a double node Hadoop cluster (following the instructions in - http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ ) and am very close to running it apart from one small glitch - when I start the dfs (using start-dfs.sh), it says: 10.63.88.53: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out 10.63.88.109: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out 10.63.88.109: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out 10.63.88.109: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out 10.63.88.53: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out which looks like it's been successful in starting all the nodes. However, when I check them out by running 'jps', this is what I see: 27531 SecondaryNameNode 27879 Jps As you can see, there is no datanode and name node. I have been racking my brains at this for quite a while now. Checked all the inputs and every thing. Any one know what the problem might be? -- Thanks in advance, Rohit
RE: How to mapreduce in the scenario
If I may, I'd like to ask about that statement a little more. I think most of us agree that hadoop handles very large (10s of TB and up) exceptionally well for several reasons. And I've heard multiple times that hadoop does not handle small datasets well and that traditional tools like RDBMS and ETL are better suited for the small datasets. But what if I have a mixture of data. I work with datasets that range from 1GB to 10TB is size, and the work requires all that data to be grouped and aggregated. I would think that in such an environment where you have vast differences in the size of datasets that it would be better to keep them all in hadoop and do all the work there versus moving the small datasets out of hadoop to do some processing on them and then loading back into hadoop to group with the larger datasets and then possible taking them back out to do more processing and then back in again. I just don't see where the run times for jobs on small files in hadoop would be so long that it wouldn't be offset by moving things back and forth. Or is the performance on small files in hadoop really that bad. Thoughts? -Original Message- From: samir das mohapatra [mailto:samir.help...@gmail.com] Sent: Wednesday, May 30, 2012 8:33 AM To: common-user@hadoop.apache.org Subject: Re: How to mapreduce in the scenario Yes . Hadoop Is only for Huge Dataset Computaion . May not good for small dataset. On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote: Hi, Mike, Nitin, Devaraj, Soumya, samir, Robert Thank you all for your suggestions. Actually, I want to know if hadoop has any advantage than routine database in performance for solving this kind of problem ( join data ). Best Regards, Gump On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee soumya.sbaner...@gmail.com wrote: Hi, You can also try to use the Hadoop Reduce Side Join functionality. Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and Reduce classes to do the same. Regards, Soumya. On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote: Hi Gump, Mapreduce fits well for solving these types(joins) of problem. I hope this will help you to solve the described problem.. 1. Mapoutput key and value classes : Write a map out put key class(Text.class), value class(CombinedValue.class). Here value class should be able to hold the values from both the files(a.txt and b.txt) as shown below. class CombinedValue implements WritableComparator { String name; int age; String address; boolean isLeft; // flag to identify from which file } 2. Mapper : Write a map() function which can parse from both the files(a.txt, b.txt) and produces common output key and value class. 3. Partitioner : Write the partitioner in such a way that it will Send all the (key, value) pairs to same reducer which are having same key. 4. Reducer : In the reduce() function, you will receive the records from both the files and you can combine those easily. Thanks Devaraj From: liuzhg [liu...@cernet.com] Sent: Tuesday, May 29, 2012 3:45 PM To: common-user@hadoop.apache.org Subject: How to mapreduce in the scenario Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump *** The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please resend this communication to the sender and delete the original message or any copy of it from your computer system. Thank You.
Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......
I got the problem with I am unable to solve it. I need to apply a filter for _SUCCESS file while using FileSystem.listStatus method. Can someone please guide me how to filter _SUCCESS files. Thanks On Tue, May 29, 2012 at 1:42 PM, waqas latif waqas...@gmail.com wrote: So my question is that do hadoop 0.20 and 1.0.3 differ in their support of writing or reading sequencefiles? same code works fine with hadoop 0.20 but problem occurs when run it under hadoop 1.0.3. On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote: But the thing is, it works with hadoop 0.20. even with 100 x100(and even bigger matrices) but when it comes to hadoop 1.0.3 then even there is a problem with 3x3 matrix. On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi prash1...@gmail.com wrote: I have seen this issue with large file writes using SequenceFile writer. Not found the same issue when testing with writing fairly small files ( 1GB). On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam kasisubbu...@gmail.comwrote: Hi, If you are using a custom writable object while passing data from the mapper to the reducer make sure that the read fields and the write has the same number of variables. It might be possible that you wrote datavtova file using custom writable but later modified the custom writable (like adding new attribute to the writable) which the old data doesn't have. It might be a possibility is please check once On Friday, May 25, 2012, waqas latif wrote: Hi Experts, I am fairly new to hadoop MapR and I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3 but I am getting following error. Is it the problem with my hadoop configuration or it is compatibility problem in the code which was written in hadoop 0.20 by author.Also please guide me that how can I fix this error in either case. Here is the error I am getting. in thread main java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60) at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87) at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112) at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150) at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278) at TestMatrixMultiply.main(TestMatrixMultiply.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Thanks in advance Regards, waqas
Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)
*Step wise Details (Ubantu 10.x version ): Go through properly and Run one by one. it will sove your problem (You can change the path,IP ,Host name as you like to do)* - 1. Start the terminal 2. Disable ipv6 on all machines pico /etc/sysctl.conf 10. Download and install hadoop: 3. Add these files to the EOF cd /usr/local/hadoop net.ipv6.conf.all.disable_ipv6 = 1 sudo wget –c http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u2.tar.gz net.ipv6.conf.default.disable_ipv6 = 1 11. Unzip the tar net.ipv6.conf.lo.disable_ipv6 = 1 sudo tar -zxvf /usr/local/hadoop/hadoop-0.20.2-chd3u2.tar.gz net.ipv6.conf.lo.disable_ipv6 = 1 12. Change permissions on hadoop folder by granting all to hadoop 3. Reboot the system sudo chown -R hadoop:hadoop /usr/local/hadoop sudo reboot sudo chmod 750 -R /usr/local/hadoop 4. Install java 13. Create the HDFS directory sudo apt-get install openjdk-6-jdk openjdk-6-jre sudo mkdir hadoop-datastore // inside the usr local hadoop folder 5. Check if ssh is installed, if not do so: sudo mkdir hadoop-datastore/hadoop-hadoop sudo apt-get install openssh-server openssh-client 14. Add the binaries path and hadoop home in the environment file 6. Create a group and user called hadoop sudo pico /etc/environment sudo addgroup hadoop set the bin path as well as hadoop home path sudo adduser --ingroup hadoop hadoop source /etc/environment 7. Assign all the permissions to the Hadoop user 15. Configure the hadoop env.sh file sudo visudo cd /usr/local/hadoop/hadoop-0.20.2-cdh3u3/ Add the following line in the file sudo pico conf/hadoop-env.sh hadoop ALL =(ALL) ALL add the following line in there: 8. Check if hadoop user has ssh installed export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true su hadoop export JAVA_HOME=/usr/lib/jvm/java-6-openjdk ssh-keygen -t rsa -P next page Press Enter when asked. cat $HOME/.ssh/id_rsa.pub $HOME/.ssh/authorized_keys ssh localhost Copy the servers RSA public key from server to all nodes in the authorized_keys file as shown in the above step 9. Make hadoop installation directory: sudo mkdir /usr/local/ 10. Download and install hadoop: cd /usr/local/hadoop sudo wget –c http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u2.tar.gz 11. Unzip the tar sudo tar -zxvf /usr/local/hadoop/hadoop-0.20.2-chd3u2.tar.gz 12. Change permissions on hadoop folder by granting all to hadoop sudo chown -R hadoop:hadoop /usr/local/hadoop sudo chmod 750 -R /usr/local/hadoop 13. Create the HDFS directory sudo mkdir hadoop-datastore // inside the usr local hadoop folder sudo mkdir hadoop-datastore/hadoop-hadoop 14. Add the binaries path and hadoop home in the environment file sudo pico /etc/environment // set the bin path as well as hadoop home path source /etc/environment 15. Configure the hadoop env.sh file cd /usr/local/hadoop/hadoop-0.20.2-cdh3u3/ sudo pico conf/hadoop-env.sh //add the following line in there: export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export JAVA_HOME=/usr/lib/jvm/java-6-openjdk 16. Configuring the core-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehadoop.tmp.dir/name value/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property property namefs.default.name/name valuehdfs://IP of namenode:54310/value descriptionLocation of the Namenode/description /property /configuration 17. Configuring the hdfs-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namedfs.replication/name value2/value descriptionDefault block replication./description /property /configuration 18. Configuring the mapred-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namemapred.job.tracker/name valueIP of job tracker:54311/value descriptionHost and port of the jobtracker. /description /property /configuration 19. Add all the IP addresses in the conf/slaves file sudo pico /usr/local/hadoop/hadoop-0.20.2-cdh3u2/conf/slaves Add the list of IP addresses that will host data nodes, in this file - *Hadoop Commands: Now restart the hadoop cluster* start-all.sh/stop-all.sh start-dfs.sh/stop-dfs.sh start-mapred.sh/stop-mapred.sh hadoop dfs -ls /virtual dfs path hadoop dfs copyFromLocal local path dfs path
Re: Writing click stream data to hadoop
On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote: Mohit, Not if you call sync (or hflush/hsync in 2.0) periodically to persist your changes to the file. SequenceFile doesn't currently have a sync-API inbuilt in it (in 1.0 at least), but you can call sync on the underlying output stream instead at the moment. This is possible to do in 1.0 (just own the output stream). Your use case also sounds like you may want to simply use Apache Flume (Incubating) [http://incubator.apache.org/flume/] that already does provide these features and the WAL-kinda reliability you seek. Thanks Harsh, Does flume also provides API on top. I am getting this data as http call, how would I go about using flume with http calls? On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We get click data through API calls. I now need to send this data to our hadoop environment. I am wondering if I could open one sequence file and write to it until it's of certain size. Once it's over the specified size I can close that file and open a new one. Is this a good approach? Only thing I worry about is what happens if the server crashes before I am able to cleanly close the file. Would I lose all previous data? -- Harsh J
Re: Writing click stream data to hadoop
I cc'd flume-u...@incubator.apache.org because I don't know if Mohit subscribed there. Mohit, you could use Avro to serialize the data and send them to a Flume Avro source. Or you could syslog - both are supported in Flume 1.x. http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/FlumeUserGuide.html An exec-source is also possible, please note, flume will only start / use the command you configured and didn't take control over the whole process. - Alex -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF On May 30, 2012, at 4:56 PM, Mohit Anchlia wrote: On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote: Mohit, Not if you call sync (or hflush/hsync in 2.0) periodically to persist your changes to the file. SequenceFile doesn't currently have a sync-API inbuilt in it (in 1.0 at least), but you can call sync on the underlying output stream instead at the moment. This is possible to do in 1.0 (just own the output stream). Your use case also sounds like you may want to simply use Apache Flume (Incubating) [http://incubator.apache.org/flume/] that already does provide these features and the WAL-kinda reliability you seek. Thanks Harsh, Does flume also provides API on top. I am getting this data as http call, how would I go about using flume with http calls? On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We get click data through API calls. I now need to send this data to our hadoop environment. I am wondering if I could open one sequence file and write to it until it's of certain size. Once it's over the specified size I can close that file and open a new one. Is this a good approach? Only thing I worry about is what happens if the server crashes before I am able to cleanly close the file. Would I lose all previous data? -- Harsh J
Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......
When your code does a listStatus, you can pass a PathFilter object along that can do this filtering for you. See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.PathFilter) for the API javadocs on that. On Wed, May 30, 2012 at 7:46 PM, waqas latif waqas...@gmail.com wrote: I got the problem with I am unable to solve it. I need to apply a filter for _SUCCESS file while using FileSystem.listStatus method. Can someone please guide me how to filter _SUCCESS files. Thanks On Tue, May 29, 2012 at 1:42 PM, waqas latif waqas...@gmail.com wrote: So my question is that do hadoop 0.20 and 1.0.3 differ in their support of writing or reading sequencefiles? same code works fine with hadoop 0.20 but problem occurs when run it under hadoop 1.0.3. On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote: But the thing is, it works with hadoop 0.20. even with 100 x100(and even bigger matrices) but when it comes to hadoop 1.0.3 then even there is a problem with 3x3 matrix. On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi prash1...@gmail.com wrote: I have seen this issue with large file writes using SequenceFile writer. Not found the same issue when testing with writing fairly small files ( 1GB). On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam kasisubbu...@gmail.comwrote: Hi, If you are using a custom writable object while passing data from the mapper to the reducer make sure that the read fields and the write has the same number of variables. It might be possible that you wrote datavtova file using custom writable but later modified the custom writable (like adding new attribute to the writable) which the old data doesn't have. It might be a possibility is please check once On Friday, May 25, 2012, waqas latif wrote: Hi Experts, I am fairly new to hadoop MapR and I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3 but I am getting following error. Is it the problem with my hadoop configuration or it is compatibility problem in the code which was written in hadoop 0.20 by author.Also please guide me that how can I fix this error in either case. Here is the error I am getting. in thread main java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60) at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87) at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112) at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150) at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278) at TestMatrixMultiply.main(TestMatrixMultiply.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Thanks in advance Regards, waqas -- Harsh J
Re: Job/Task Log timestamp questions
have you figured out why this timestamps are different? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Job-Task-Log-timestamp-questions-tp2123486p3986779.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......
Thanks Harsh. I got it running. On Wed, May 30, 2012 at 5:58 PM, Harsh J ha...@cloudera.com wrote: When your code does a listStatus, you can pass a PathFilter object along that can do this filtering for you. See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.PathFilter) for the API javadocs on that. On Wed, May 30, 2012 at 7:46 PM, waqas latif waqas...@gmail.com wrote: I got the problem with I am unable to solve it. I need to apply a filter for _SUCCESS file while using FileSystem.listStatus method. Can someone please guide me how to filter _SUCCESS files. Thanks On Tue, May 29, 2012 at 1:42 PM, waqas latif waqas...@gmail.com wrote: So my question is that do hadoop 0.20 and 1.0.3 differ in their support of writing or reading sequencefiles? same code works fine with hadoop 0.20 but problem occurs when run it under hadoop 1.0.3. On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote: But the thing is, it works with hadoop 0.20. even with 100 x100(and even bigger matrices) but when it comes to hadoop 1.0.3 then even there is a problem with 3x3 matrix. On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi prash1...@gmail.com wrote: I have seen this issue with large file writes using SequenceFile writer. Not found the same issue when testing with writing fairly small files ( 1GB). On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam kasisubbu...@gmail.comwrote: Hi, If you are using a custom writable object while passing data from the mapper to the reducer make sure that the read fields and the write has the same number of variables. It might be possible that you wrote datavtova file using custom writable but later modified the custom writable (like adding new attribute to the writable) which the old data doesn't have. It might be a possibility is please check once On Friday, May 25, 2012, waqas latif wrote: Hi Experts, I am fairly new to hadoop MapR and I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3 but I am getting following error. Is it the problem with my hadoop configuration or it is compatibility problem in the code which was written in hadoop 0.20 by author.Also please guide me that how can I fix this error in either case. Here is the error I am getting. in thread main java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60) at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87) at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112) at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150) at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278) at TestMatrixMultiply.main(TestMatrixMultiply.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Thanks in advance Regards, waqas -- Harsh J
Re: different input/output formats
Hi I think attachment will not got thgrough the common-user@hadoop.apache.org. Ok Please have a look bellow. MAP package test; import java.io.IOException; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } REDUCER -- Prepare reducer what exactly you want for. JOB package test; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class TestDemo extends Configured implements Tool{ public static void main(String args[]) throws Exception{ int res = ToolRunner.run(new Configuration(), new TestDemo(),args); System.exit(res); } @Override public int run(String[] args) throws Exception { JobConf conf = new JobConf(TestDemo.class); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); conf.setJobName(TestCustomInputOutput); conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } } On Wed, May 30, 2012 at 6:57 PM, samir das mohapatra samir.help...@gmail.com wrote: PFA. On Wed, May 30, 2012 at 2:45 AM, Mark question markq2...@gmail.comwrote: Hi Samir, can you email me your main class.. or if you can check mine, it is as follows: public class SortByNorm1 extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir outputDir\n); ToolRunner.printGenericCommandUsage(System.err); return -1; } JobConf conf = new JobConf(new Configuration(),SortByNorm1.class); conf.setJobName(SortDocByNorm1); conf.setMapperClass(Norm1Mapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setReducerClass(Norm1Reducer.class); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new SortByNorm1(), args); System.exit(exitCode); } On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark See the out put for that same Application . I am not getting any error. On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0);
Re: Writing click stream data to hadoop
SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since 0.20.205), which calls the underlying FSDataOutputStream#sync which is actually hflush semantically (data not durable in case of data center wide power outage). hsync implementation is not yet in 2.0. HDFS-744 just brought hsync in trunk. __Luke On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote: Mohit, Not if you call sync (or hflush/hsync in 2.0) periodically to persist your changes to the file. SequenceFile doesn't currently have a sync-API inbuilt in it (in 1.0 at least), but you can call sync on the underlying output stream instead at the moment. This is possible to do in 1.0 (just own the output stream). Your use case also sounds like you may want to simply use Apache Flume (Incubating) [http://incubator.apache.org/flume/] that already does provide these features and the WAL-kinda reliability you seek. On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We get click data through API calls. I now need to send this data to our hadoop environment. I am wondering if I could open one sequence file and write to it until it's of certain size. Once it's over the specified size I can close that file and open a new one. Is this a good approach? Only thing I worry about is what happens if the server crashes before I am able to cleanly close the file. Would I lose all previous data? -- Harsh J
Re: Writing click stream data to hadoop
Thanks for correcting me there on the syncFs call Luke. I seemed to have missed that method when searching branch-1 code. On Thu, May 31, 2012 at 6:54 AM, Luke Lu l...@apache.org wrote: SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since 0.20.205), which calls the underlying FSDataOutputStream#sync which is actually hflush semantically (data not durable in case of data center wide power outage). hsync implementation is not yet in 2.0. HDFS-744 just brought hsync in trunk. __Luke On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote: Mohit, Not if you call sync (or hflush/hsync in 2.0) periodically to persist your changes to the file. SequenceFile doesn't currently have a sync-API inbuilt in it (in 1.0 at least), but you can call sync on the underlying output stream instead at the moment. This is possible to do in 1.0 (just own the output stream). Your use case also sounds like you may want to simply use Apache Flume (Incubating) [http://incubator.apache.org/flume/] that already does provide these features and the WAL-kinda reliability you seek. On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We get click data through API calls. I now need to send this data to our hadoop environment. I am wondering if I could open one sequence file and write to it until it's of certain size. Once it's over the specified size I can close that file and open a new one. Is this a good approach? Only thing I worry about is what happens if the server crashes before I am able to cleanly close the file. Would I lose all previous data? -- Harsh J -- Harsh J
Re: Dynamic Priority Scheduler
Hello cldo, If you can be clearer on your scheduling/workload needs we can suggest proper configuration/scheduler implementations for your cluster. On Thu, May 31, 2012 at 9:49 AM, cldo cldo datk...@gmail.com wrote: i am using hadoop version 1.0 and capacity scheduler but not effective how to use Dynamic Priority Scheduler -- Harsh J