date:20120530

Re: Pragmatic cluster backup strategies?

2012-05-30 Thread Darrell Taylor

Will hadoop fs -rm -rf move everything to the the /trash directory or
will it delete that as well?

I was thinking along the lines of what you suggest, keep the original
source of the data somewhere and then reprocess it all in the event of a
problem.

What do other people do?  Do you run another cluster?  Do you backup
specific parts of the cluster?  Some form of offsite SAN?

On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Yes you will have redundancy, so no single point of hardware failure can
 wipe out your data, short of a major catastrophe.  But you can still have
 an errant or malicious hadoop fs -rm -rf shut you down.  If you still
 have the original source of your data somewhere else you may be able to
 recover, by reprocessing the data, but if this cluster is your single
 repository for all your data you may have a problem.

 --Bobby Evans

 On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,
 That's not a back up strategy.
 You could still have joe luser take out a key file or directory. What do
 you do then?

 On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

  Hi,
 
  We are about to build a 10 machine cluster with 40Tb of storage,
 obviously
  as this gets full actually trying to create an offsite backup becomes a
  problem unless we build another 10 machine cluster (too expensive right
  now).  Not sure if it will help but we have planned the cabinet into an
  upper and lower half with separate redundant power, then we plan to put
  half of the cluster in the top, half in the bottom, effectively 2 racks,
 so
  in theory we could lose half the cluster and still have the copies of all
  the blocks with a replication factor of 3?  Apart form the data centre
  burning down or some other disaster that would render the machines
 totally
  unrecoverable, is this approach good enough?
 
  I realise this is a very open question and everyone's circumstances are
  different, but I'm wondering what other peoples experiences/opinions are
  for backing up cluster data?
 
  Thanks
  Darrell.

Re: Pragmatic cluster backup strategies?

2012-05-30 Thread alo alt

Hi,

you could set fs.trash.interval into the number of minutes you want to consider 
that the rm'd data will lost forever. The data will be moved into .Trash and 
deleted after the configured time.
Second way could be to use mount.fuse to mount the HDFS and backup over that 
mount your data into a storage tier. That is not the best solution, but a 
useable way. 

cheers,
 Alex 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 8:31 AM, Darrell Taylor wrote:

 Will hadoop fs -rm -rf move everything to the the /trash directory or
 will it delete that as well?
 
 I was thinking along the lines of what you suggest, keep the original
 source of the data somewhere and then reprocess it all in the event of a
 problem.
 
 What do other people do?  Do you run another cluster?  Do you backup
 specific parts of the cluster?  Some form of offsite SAN?
 
 On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:
 
 Yes you will have redundancy, so no single point of hardware failure can
 wipe out your data, short of a major catastrophe.  But you can still have
 an errant or malicious hadoop fs -rm -rf shut you down.  If you still
 have the original source of your data somewhere else you may be able to
 recover, by reprocessing the data, but if this cluster is your single
 repository for all your data you may have a problem.
 
 --Bobby Evans
 
 On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:
 
 Hi,
 That's not a back up strategy.
 You could still have joe luser take out a key file or directory. What do
 you do then?
 
 On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
 
 Hi,
 
 We are about to build a 10 machine cluster with 40Tb of storage,
 obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks,
 so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines
 totally
 unrecoverable, is this approach good enough?
 
 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?
 
 Thanks
 Darrell.

RE: Best Practices for Upgrading Hadoop Version?

2012-05-30 Thread ramon.pin

Hi,

   I did this upgrade on a similar cluster some weeks ago. I use the following 
method (all commands run with hadoop demons process owner):

*   Stop cluster.
*   Start only HDFS with :  start-dfs.sh -upgrade
*   At this point the migration has started.
*   You can check the status with hadoop dfsadmin -upgradeProgress status
*   Now you can access files for reading.
*   If you find any issue can rollback migration with : start-dfs.sh 
-rollback
*   If everything seems ok you can mark the upgrade as finalized: hadoop 
dfsadmin -finalizeUpgrade


-Original Message-
From: Eli Finkelshteyn [mailto:iefin...@gmail.com]
Sent: martes, 29 de mayo de 2012 20:29
To: common-user@hadoop.apache.org
Subject: Best Practices for Upgrading Hadoop Version?

Hi,
I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3. I'm 
running a pretty small cluster of just 4 nodes, and it's not really being used 
by too many people at the moment, so I'm OK if things get dirty or it goes 
offline for a bit. I was looking at the tutorial at wiki.apache.org 
http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it seems either outdated, 
or missing information. Namely, from what I've noticed so far, it doesn't 
specify what user any of the commands should be run as. Since I'm sure this is 
something a lot of people have needed to do, Is there a better tutorial 
somewhere for updating Hadoop version in general?

Eli


Subject to local law, communications with Accenture and its affiliates 
including telephone calls and emails (including content), may be monitored by 
our systems for the purposes of security and the assessment of internal 
compliance with Accenture policy.
__

www.accenture.com

Hadoop BI Usergroup Stuttgart (Germany)

2012-05-30 Thread alo alt

For our german speaking folks,

we want to start a Hadoop-BI Usergroup in Stuttgart (Germany), if you have 
interest in please visit our LinkedIn Group 
(http://www.linkedin.com/groups/Hadoop-Germany-4325443) and our Doodle poll 
(http://www.doodle.com/aqwsg4snbwimrsfc). If we figure out a real interest we 
will call for sponsors and speakers later. 

(german)
Schwerpunkte:
- Integration von Hadoop / HDFS basierenden Lösungen in bestehende 
Infrastrukturen
- Export von Daten aus Relationalen Datenbanken in NoSQL / Analyse Cluster 
(HBase, Hive)
- Statistische Auswertungen (Mahout) 
- ISO kompatible Lösungsansätze Backup und Recovery, HA mittels OpenSource





--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

Re: Best Practices for Upgrading Hadoop Version?

2012-05-30 Thread Chris Smith

Michael Noll has a good description of the upgrade process here:
http://www.michael-noll.com/blog/2011/08/23/performing-an-hdfs-upgrade-of-an-hadoop-cluster/

If may not quite reflect the versions of Hadoop you plan to upgrade but it
has some good pointers.

Chris

On 30 May 2012 09:12, ramon@accenture.com wrote:

 Hi,

   I did this upgrade on a similar cluster some weeks ago. I use the
 following method (all commands run with hadoop demons process owner):

 *   Stop cluster.
 *   Start only HDFS with :  start-dfs.sh -upgrade
 *   At this point the migration has started.
 *   You can check the status with hadoop dfsadmin -upgradeProgress
 status
 *   Now you can access files for reading.
 *   If you find any issue can rollback migration with : start-dfs.sh
 -rollback
 *   If everything seems ok you can mark the upgrade as finalized:
 hadoop dfsadmin -finalizeUpgrade


 -Original Message-
 From: Eli Finkelshteyn [mailto:iefin...@gmail.com]
 Sent: martes, 29 de mayo de 2012 20:29
 To: common-user@hadoop.apache.org
 Subject: Best Practices for Upgrading Hadoop Version?

 Hi,
 I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3.
 I'm running a pretty small cluster of just 4 nodes, and it's not really
 being used by too many people at the moment, so I'm OK if things get dirty
 or it goes offline for a bit. I was looking at the tutorial at
 wiki.apache.org http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it
 seems either outdated, or missing information. Namely, from what I've
 noticed so far, it doesn't specify what user any of the commands should be
 run as. Since I'm sure this is something a lot of people have needed to do,
 Is there a better tutorial somewhere for updating Hadoop version in general?

 Eli

 
 Subject to local law, communications with Accenture and its affiliates
 including telephone calls and emails (including content), may be monitored
 by our systems for the purposes of security and the assessment of internal
 compliance with Accenture policy.

 __

 www.accenture.com

Re: different input/output formats

2012-05-30 Thread samir das mohapatra

PFA.

On Wed, May 30, 2012 at 2:45 AM, Mark question markq2...@gmail.com wrote:

 Hi Samir, can you email me your main class.. or if you can check mine, it
 is as follows:

 public class SortByNorm1 extends Configured implements Tool {

@Override public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir
 outputDir\n);
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
JobConf conf = new JobConf(new Configuration(),SortByNorm1.class);
conf.setJobName(SortDocByNorm1);
conf.setMapperClass(Norm1Mapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setReducerClass(Norm1Reducer.class);
 conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
 JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByNorm1(), args);
System.exit(exitCode);
 }


 On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra 
 samir.help...@gmail.com wrote:

  Hi Mark
 See the out put for that same  Application .
 I am  not getting any error.
 
 
  On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com
 wrote:
 
  Hi guys, this is a very simple  program, trying to use TextInputFormat
 and
  SequenceFileoutputFormat. Should be easy but I get the same error.
 
  Here is my configurations:
 
 conf.setMapperClass(myMapper.class);
 conf.setMapOutputKeyClass(FloatWritable.class);
 conf.setMapOutputValueClass(Text.class);
 conf.setNumReduceTasks(0);
 conf.setOutputKeyClass(FloatWritable.class);
 conf.setOutputValueClass(Text.class);
 
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(SequenceFileOutputFormat.class);
 
 TextInputFormat.addInputPath(conf, new Path(args[0]));
 SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
 
  myMapper class is:
 
  public class myMapper extends MapReduceBase implements
  MapperLongWritable,Text,FloatWritable,Text {
 
 public void map(LongWritable offset, Text
  val,OutputCollectorFloatWritable,Text output, Reporter reporter)
 throws IOException {
 output.collect(new FloatWritable(1), val);
  }
  }
 
  But I get the following error:
 
  12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
  attempt_201205260045_0032_m_00_0, Status : FAILED
  java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable
 is
  not class org.apache.hadoop.io.FloatWritable
 at
  org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
 at
 
 
 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
 at
 
 
 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
 at
 
 
 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.Use
 
  Where is the writing of LongWritable coming from ??
 
  Thank you,
  Mark

Re: How to mapreduce in the scenario

2012-05-30 Thread samir das mohapatra

Yes . Hadoop Is only for Huge Dataset Computaion .
  May not good for small dataset.

On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote:

 Hi,

 Mike, Nitin, Devaraj, Soumya, samir, Robert

 Thank you all for your suggestions.

 Actually, I want to know if hadoop has any advantage than routine database
 in performance for solving this kind of problem ( join data ).



 Best Regards,

 Gump





 On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
 soumya.sbaner...@gmail.com wrote:

 Hi,

 You can also try to use the Hadoop Reduce Side Join functionality.
 Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
 Reduce classes to do the same.

 Regards,
 Soumya.


 On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

  Hi Gump,
 
Mapreduce fits well for solving these types(joins) of problem.
 
  I hope this will help you to solve the described problem..
 
  1. Mapoutput key and value classes : Write a map out put key
  class(Text.class), value class(CombinedValue.class). Here value class
  should be able to hold the values from both the files(a.txt and b.txt) as
  shown below.
 
  class CombinedValue implements WritableComparator
  {
String name;
int age;
String address;
boolean isLeft; // flag to identify from which file
  }
 
  2. Mapper : Write a map() function which can parse from both the
  files(a.txt, b.txt) and produces common output key and value class.
 
  3. Partitioner : Write the partitioner in such a way that it will Send
 all
  the (key, value) pairs to same reducer which are having same key.
 
  4. Reducer : In the reduce() function, you will receive the records from
  both the files and you can combine those easily.
 
 
  Thanks
  Devaraj
 
 
  
  From: liuzhg [liu...@cernet.com]
  Sent: Tuesday, May 29, 2012 3:45 PM
  To: common-user@hadoop.apache.org
  Subject: How to mapreduce in the scenario
 
  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt：
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump

Re: Pragmatic cluster backup strategies?

2012-05-30 Thread Robert Evans

I am not an expert on the trash so you probably want to verify everything I am 
about to say.  I believe that trash acts oddly when you try to use it to delete 
a trash directory.  Quotas can potentially get off when doing this, but I think 
it still deletes the directory.  Trash is a nice feature, but I wouldn't trust 
it as a true backup.  I just don't think it is mature enough for something like 
that.  There are enough issues with quotas that sadly most of our users almost 
always add -skipTrash all the time.

Where I work we do a combination of several different things depending on the 
project and their requirements.  In some cases where there are government 
regulations involved we do regular tape backups.  In other cases we keep the 
original data around for some time and can re-import it to HDFS if necessary.  
In other cases we will copy the data, to multiple Hadoop clusters.  This is 
usually for the case where we want to do Hot/Warm failover between clusters.  
Now we may be different from most other users because we do run lots of 
different projects on lots of different clusters.

--Bobby Evans

On 5/30/12 1:31 AM, Darrell Taylor darrell.tay...@gmail.com wrote:

Will hadoop fs -rm -rf move everything to the the /trash directory or
will it delete that as well?

I was thinking along the lines of what you suggest, keep the original
source of the data somewhere and then reprocess it all in the event of a
problem.

What do other people do?  Do you run another cluster?  Do you backup
specific parts of the cluster?  Some form of offsite SAN?

On Tue, May 29, 2012 at 6:02 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Yes you will have redundancy, so no single point of hardware failure can
 wipe out your data, short of a major catastrophe.  But you can still have
 an errant or malicious hadoop fs -rm -rf shut you down.  If you still
 have the original source of your data somewhere else you may be able to
 recover, by reprocessing the data, but if this cluster is your single
 repository for all your data you may have a problem.

 --Bobby Evans

 On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,
 That's not a back up strategy.
 You could still have joe luser take out a key file or directory. What do
 you do then?

 On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

  Hi,
 
  We are about to build a 10 machine cluster with 40Tb of storage,
 obviously
  as this gets full actually trying to create an offsite backup becomes a
  problem unless we build another 10 machine cluster (too expensive right
  now).  Not sure if it will help but we have planned the cabinet into an
  upper and lower half with separate redundant power, then we plan to put
  half of the cluster in the top, half in the bottom, effectively 2 racks,
 so
  in theory we could lose half the cluster and still have the copies of all
  the blocks with a replication factor of 3?  Apart form the data centre
  burning down or some other disaster that would render the machines
 totally
  unrecoverable, is this approach good enough?
 
  I realise this is a very open question and everyone's circumstances are
  different, but I'm wondering what other peoples experiences/opinions are
  for backing up cluster data?
 
  Thanks
  Darrell.

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-30 Thread shashwat shriparv

Please send you conf file contents and host file contents too..


On Tue, May 29, 2012 at 11:08 PM, Harsh J ha...@cloudera.com wrote:

 Rohit,

 The SNN may start and run infinitely without doing any work. The NN
 and DN have probably not started cause the NN has an issue (perhaps NN
 name directory isn't formatted) and the DN can't find the NN (or has
 data directory issues as well).

 So this isn't a glitch but a real issue you'll have to take a look at
 your logs for.

 On Sun, May 27, 2012 at 10:51 PM, Rohit Pandey rohitpandey...@gmail.com
 wrote:
  Hello Hadoop community,
 
  I have been trying to set up a double node Hadoop cluster (following
  the instructions in -
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
 )
  and am very close to running it apart from one small glitch - when I
  start the dfs (using start-dfs.sh), it says:
 
  10.63.88.53: starting datanode, logging to
  /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out
  10.63.88.109: starting datanode, logging to
 
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out
  10.63.88.109: starting secondarynamenode, logging to
 
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out
  starting jobtracker, logging to
 
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out
  10.63.88.109: starting tasktracker, logging to
 
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out
  10.63.88.53: starting tasktracker, logging to
  /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out
 
  which looks like it's been successful in starting all the nodes.
  However, when I check them out by running 'jps', this is what I see:
  27531 SecondaryNameNode
  27879 Jps
 
  As you can see, there is no datanode and name node. I have been
  racking my brains at this for quite a while now. Checked all the
  inputs and every thing. Any one know what the problem might be?
 
  --
 
  Thanks in advance,
 
  Rohit



 --
 Harsh J




-- 


∞
Shashwat Shriparv

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-30 Thread samir das mohapatra

 In your logs details  i colud not find the NN stating.

 It is the Problem of NN itself.

  Harsh also  suggested for that same.

On Sun, May 27, 2012 at 10:51 PM, Rohit Pandey rohitpandey...@gmail.comwrote:

 Hello Hadoop community,

 I have been trying to set up a double node Hadoop cluster (following
 the instructions in -

 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
 )
 and am very close to running it apart from one small glitch - when I
 start the dfs (using start-dfs.sh), it says:

 10.63.88.53: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out
 10.63.88.109: starting datanode, logging to

 /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out
 10.63.88.109: starting secondarynamenode, logging to

 /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out
 starting jobtracker, logging to

 /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out
 10.63.88.109: starting tasktracker, logging to

 /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out
 10.63.88.53: starting tasktracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out

 which looks like it's been successful in starting all the nodes.
 However, when I check them out by running 'jps', this is what I see:
 27531 SecondaryNameNode
 27879 Jps

 As you can see, there is no datanode and name node. I have been
 racking my brains at this for quite a while now. Checked all the
 inputs and every thing. Any one know what the problem might be?

 --

 Thanks in advance,

 Rohit

RE: How to mapreduce in the scenario

2012-05-30 Thread Wilson Wayne - wwilso

If I may, I'd like to ask about that statement a little more.  

I think most of us agree that hadoop handles very large (10s of TB and up) 
exceptionally well for several reasons. And I've heard multiple times that 
hadoop does not handle small datasets well and that traditional tools like 
RDBMS and ETL are better suited for the small datasets.  But what if I have a 
mixture of data.  I work with datasets that range from 1GB to 10TB is size, and 
the work requires all that data to be grouped and aggregated.  I would think 
that in such an environment where you have vast differences in the size of 
datasets that it would be better to keep them all in hadoop and do all the work 
there versus moving the small datasets out of hadoop to do some processing on 
them and then loading back into hadoop to group with the larger datasets and 
then possible taking them back out to do more processing and then back in 
again.  I just don't see where the run times for jobs on small files in hadoop 
would be so long that it wouldn't be offset by moving things back and forth.  
Or is the performance on small files in hadoop really that bad.  Thoughts?

-Original Message-
From: samir das mohapatra [mailto:samir.help...@gmail.com] 
Sent: Wednesday, May 30, 2012 8:33 AM
To: common-user@hadoop.apache.org
Subject: Re: How to mapreduce in the scenario

Yes . Hadoop Is only for Huge Dataset Computaion .
  May not good for small dataset.

On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote:

 Hi,

 Mike, Nitin, Devaraj, Soumya, samir, Robert

 Thank you all for your suggestions.

 Actually, I want to know if hadoop has any advantage than routine database
 in performance for solving this kind of problem ( join data ).



 Best Regards,

 Gump





 On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
 soumya.sbaner...@gmail.com wrote:

 Hi,

 You can also try to use the Hadoop Reduce Side Join functionality.
 Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
 Reduce classes to do the same.

 Regards,
 Soumya.


 On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

  Hi Gump,
 
Mapreduce fits well for solving these types(joins) of problem.
 
  I hope this will help you to solve the described problem..
 
  1. Mapoutput key and value classes : Write a map out put key
  class(Text.class), value class(CombinedValue.class). Here value class
  should be able to hold the values from both the files(a.txt and b.txt) as
  shown below.
 
  class CombinedValue implements WritableComparator
  {
String name;
int age;
String address;
boolean isLeft; // flag to identify from which file
  }
 
  2. Mapper : Write a map() function which can parse from both the
  files(a.txt, b.txt) and produces common output key and value class.
 
  3. Partitioner : Write the partitioner in such a way that it will Send
 all
  the (key, value) pairs to same reducer which are having same key.
 
  4. Reducer : In the reduce() function, you will receive the records from
  both the files and you can combine those easily.
 
 
  Thanks
  Devaraj
 
 
  
  From: liuzhg [liu...@cernet.com]
  Sent: Tuesday, May 29, 2012 3:45 PM
  To: common-user@hadoop.apache.org
  Subject: How to mapreduce in the scenario
 
  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt：
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump
 




***
The information contained in this communication is confidential, is
intended only for the use of the recipient named above, and may be legally
privileged.

If the reader of this message is not the intended recipient, you are
hereby notified that any dissemination, distribution or copying of this
communication is strictly prohibited.

If you have received this communication in error, please resend this
communication to the sender and delete the original message or any copy
of it from your computer system.

Thank You.

Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

2012-05-30 Thread waqas latif

I got the problem with I am unable to solve it. I need to apply a filter
for _SUCCESS file while using FileSystem.listStatus method. Can someone
please guide me how to filter _SUCCESS files. Thanks

On Tue, May 29, 2012 at 1:42 PM, waqas latif waqas...@gmail.com wrote:

 So my question is that do hadoop 0.20 and 1.0.3 differ in their support of
 writing or reading sequencefiles? same code works fine with hadoop 0.20 but
 problem occurs when run it under hadoop 1.0.3.


 On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote:

 But the thing is, it works with hadoop 0.20. even with 100 x100(and even
 bigger matrices)  but when it comes to hadoop 1.0.3 then even there is a
 problem with 3x3 matrix.


 On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi 
 prash1...@gmail.com wrote:

 I have seen this issue with large file writes using SequenceFile writer.
 Not found the same issue when testing with writing fairly small files ( 
 1GB).

 On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam
 kasisubbu...@gmail.comwrote:

  Hi,
  If you are using a custom writable object while passing data from the
  mapper to the reducer make sure that the read fields and the write has
 the
  same number of variables. It might be possible that you wrote datavtova
  file using custom writable but later modified the custom writable (like
  adding new attribute to the writable) which the old data doesn't have.
 
  It might be a possibility is please check once
 
  On Friday, May 25, 2012, waqas latif wrote:
 
   Hi Experts,
  
   I am fairly new to hadoop MapR and I was trying to run a matrix
   multiplication example presented by Mr. Norstadt under following link
   http://www.norstad.org/matrix-multiply/index.html. I can run it
   successfully with hadoop 0.20.2 but I tried to run it with hadoop
 1.0.3
  but
   I am getting following error. Is it the problem with my hadoop
   configuration or it is compatibility problem in the code which was
  written
   in hadoop 0.20 by author.Also please guide me that how can I fix this
  error
   in either case. Here is the error I am getting.
  
   in thread main java.io.EOFException
  at java.io.DataInputStream.readFully(DataInputStream.java:180)
  at java.io.DataInputStream.readFully(DataInputStream.java:152)
  at
   org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
  at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
  at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
  at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
  at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60)
  at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87)
  at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112)
  at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150)
  at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278)
  at TestMatrixMultiply.main(TestMatrixMultiply.java:308)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
  
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
  
   Thanks in advance
  
   Regards,
   waqas

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-30 Thread samir das mohapatra

*Step wise Details (Ubantu 10.x version ): Go through properly and Run one
by one. it will sove your problem (You can change the path,IP ,Host name as
you like to do)*
-
1. Start the terminal

2. Disable ipv6 on all machines
pico /etc/sysctl.conf 10. Download and install hadoop:

3. Add these files to the EOF cd /usr/local/hadoop
net.ipv6.conf.all.disable_ipv6 = 1 sudo wget –c
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u2.tar.gz
net.ipv6.conf.default.disable_ipv6 = 1 11. Unzip the tar
net.ipv6.conf.lo.disable_ipv6 = 1 sudo tar -zxvf
/usr/local/hadoop/hadoop-0.20.2-chd3u2.tar.gz
net.ipv6.conf.lo.disable_ipv6 = 1 12. Change permissions on hadoop folder
by granting all to hadoop

3. Reboot the system sudo chown -R hadoop:hadoop /usr/local/hadoop
sudo reboot sudo chmod 750 -R /usr/local/hadoop

4. Install java 13. Create the HDFS directory
sudo apt-get install openjdk-6-jdk openjdk-6-jre sudo mkdir
hadoop-datastore // inside the usr local hadoop folder

5. Check if ssh is installed, if not do so: sudo mkdir
hadoop-datastore/hadoop-hadoop
sudo apt-get install openssh-server openssh-client 14. Add the binaries
path and hadoop home in the environment file

6. Create a group and user called hadoop sudo pico /etc/environment
sudo addgroup hadoop set the bin path as well as hadoop home path
sudo adduser --ingroup hadoop hadoop source /etc/environment

7. Assign all the permissions to the Hadoop user 15. Configure the hadoop
env.sh file
sudo visudo cd /usr/local/hadoop/hadoop-0.20.2-cdh3u3/
Add the following line in the file sudo pico conf/hadoop-env.sh
hadoop ALL =(ALL) ALL add the following line in there:

8. Check if hadoop user has ssh installed export
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
su hadoop export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
ssh-keygen -t rsa -P  next page
Press Enter when asked.
cat $HOME/.ssh/id_rsa.pub  $HOME/.ssh/authorized_keys
ssh localhost
Copy the servers RSA public key from server to all nodes
in the authorized_keys file as shown in the above step

9. Make hadoop installation directory:
sudo mkdir /usr/local/


10. Download and install hadoop:
cd /usr/local/hadoop
sudo wget –c http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u2.tar.gz


11. Unzip the tar
sudo tar -zxvf /usr/local/hadoop/hadoop-0.20.2-chd3u2.tar.gz

12. Change permissions on hadoop folder by granting all to hadoop
sudo chown -R hadoop:hadoop /usr/local/hadoop
sudo chmod 750 -R /usr/local/hadoop

13. Create the HDFS directory
sudo mkdir hadoop-datastore // inside the usr local hadoop folder
sudo mkdir hadoop-datastore/hadoop-hadoop

14. Add the binaries path and hadoop home in the environment file
  sudo pico /etc/environment
 // set the bin path as well as hadoop home path
source /etc/environment

15. Configure the hadoop env.sh file

cd /usr/local/hadoop/hadoop-0.20.2-cdh3u3/

sudo pico conf/hadoop-env.sh
//add the following line in there:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk


16. Configuring the core-site.xml

?xml version=1.0? ?xml-stylesheet type=text/xsl
href=configuration.xsl?
configuration
property
namehadoop.tmp.dir/name
value/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}/value
descriptionA base for other temporary directories./description
/property
property
namefs.default.name/name
valuehdfs://IP of namenode:54310/value
descriptionLocation of the Namenode/description
/property
/configuration

17. Configuring the hdfs-site.xml
?xml version=1.0? ?xml-stylesheet type=text/xsl
href=configuration.xsl?
configuration
property
namedfs.replication/name
value2/value
descriptionDefault block replication./description
/property
/configuration

18. Configuring the mapred-site.xml
?xml version=1.0? ?xml-stylesheet type=text/xsl
href=configuration.xsl?
configuration
property
namemapred.job.tracker/name
valueIP of job tracker:54311/value
descriptionHost and port of the jobtracker.
/description
/property
/configuration

19. Add all the IP addresses in the conf/slaves file
sudo pico /usr/local/hadoop/hadoop-0.20.2-cdh3u2/conf/slaves
 Add the list of IP addresses that will host data nodes, in this file

-

*Hadoop Commands: Now restart the hadoop cluster*
start-all.sh/stop-all.sh
start-dfs.sh/stop-dfs.sh
start-mapred.sh/stop-mapred.sh
hadoop dfs -ls /virtual dfs path
hadoop dfs copyFromLocal local path dfs path

Re: Writing click stream data to hadoop

2012-05-30 Thread Mohit Anchlia

On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).

 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.


Thanks Harsh, Does flume also provides API on top. I am getting this data
as http call, how would I go about using flume with http calls?


 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  We get click data through API calls. I now need to send this data to our
  hadoop environment. I am wondering if I could open one sequence file and
  write to it until it's of certain size. Once it's over the specified
 size I
  can close that file and open a new one. Is this a good approach?
 
  Only thing I worry about is what happens if the server crashes before I
 am
  able to cleanly close the file. Would I lose all previous data?



 --
 Harsh J

Re: Writing click stream data to hadoop

2012-05-30 Thread alo alt

I cc'd flume-u...@incubator.apache.org because I don't know if Mohit subscribed 
there.

Mohit,

you could use Avro to serialize the data and send them to a Flume Avro source. 
Or you could syslog - both are supported in Flume 1.x. 
http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/FlumeUserGuide.html

An exec-source is also possible, please note, flume will only start / use the 
command you configured and didn't take control over the whole process.

- Alex 



--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 4:56 PM, Mohit Anchlia wrote:

 On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
 
 Mohit,
 
 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).
 
 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.
 
 
 Thanks Harsh, Does flume also provides API on top. I am getting this data
 as http call, how would I go about using flume with http calls?
 
 
 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 We get click data through API calls. I now need to send this data to our
 hadoop environment. I am wondering if I could open one sequence file and
 write to it until it's of certain size. Once it's over the specified
 size I
 can close that file and open a new one. Is this a good approach?
 
 Only thing I worry about is what happens if the server crashes before I
 am
 able to cleanly close the file. Would I lose all previous data?
 
 
 
 --
 Harsh J

Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

2012-05-30 Thread Harsh J

When your code does a listStatus, you can pass a PathFilter object
along that can do this filtering for you. See
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.PathFilter)
for the API javadocs on that.

On Wed, May 30, 2012 at 7:46 PM, waqas latif waqas...@gmail.com wrote:
 I got the problem with I am unable to solve it. I need to apply a filter
 for _SUCCESS file while using FileSystem.listStatus method. Can someone
 please guide me how to filter _SUCCESS files. Thanks

 On Tue, May 29, 2012 at 1:42 PM, waqas latif waqas...@gmail.com wrote:

 So my question is that do hadoop 0.20 and 1.0.3 differ in their support of
 writing or reading sequencefiles? same code works fine with hadoop 0.20 but
 problem occurs when run it under hadoop 1.0.3.


 On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote:

 But the thing is, it works with hadoop 0.20. even with 100 x100(and even
 bigger matrices)  but when it comes to hadoop 1.0.3 then even there is a
 problem with 3x3 matrix.


 On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi 
 prash1...@gmail.com wrote:

 I have seen this issue with large file writes using SequenceFile writer.
 Not found the same issue when testing with writing fairly small files ( 
 1GB).

 On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam
 kasisubbu...@gmail.comwrote:

  Hi,
  If you are using a custom writable object while passing data from the
  mapper to the reducer make sure that the read fields and the write has
 the
  same number of variables. It might be possible that you wrote datavtova
  file using custom writable but later modified the custom writable (like
  adding new attribute to the writable) which the old data doesn't have.
 
  It might be a possibility is please check once
 
  On Friday, May 25, 2012, waqas latif wrote:
 
   Hi Experts,
  
   I am fairly new to hadoop MapR and I was trying to run a matrix
   multiplication example presented by Mr. Norstadt under following link
   http://www.norstad.org/matrix-multiply/index.html. I can run it
   successfully with hadoop 0.20.2 but I tried to run it with hadoop
 1.0.3
  but
   I am getting following error. Is it the problem with my hadoop
   configuration or it is compatibility problem in the code which was
  written
   in hadoop 0.20 by author.Also please guide me that how can I fix this
  error
   in either case. Here is the error I am getting.
  
   in thread main java.io.EOFException
          at java.io.DataInputStream.readFully(DataInputStream.java:180)
          at java.io.DataInputStream.readFully(DataInputStream.java:152)
          at
   org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
          at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
          at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
          at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
          at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60)
          at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87)
          at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112)
          at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150)
          at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278)
          at TestMatrixMultiply.main(TestMatrixMultiply.java:308)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at
  
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at
  
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
  
   Thanks in advance
  
   Regards,
   waqas
  
 







-- 
Harsh J

Re: Job/Task Log timestamp questions

2012-05-30 Thread sharat_racha

have you figured out why this timestamps are different?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Job-Task-Log-timestamp-questions-tp2123486p3986779.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

2012-05-30 Thread waqas latif

Thanks Harsh. I got it running.

On Wed, May 30, 2012 at 5:58 PM, Harsh J ha...@cloudera.com wrote:

When your code does a listStatus, you can pass a PathFilter object
along that can do this filtering for you. See

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#listStatus(org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.PathFilter)
for the API javadocs on that.

On Wed, May 30, 2012 at 7:46 PM, waqas latif waqas...@gmail.com wrote:
I got the problem with I am unable to solve it. I need to apply a filter
for _SUCCESS file while using FileSystem.listStatus method. Can someone
please guide me how to filter _SUCCESS files. Thanks

On Tue, May 29, 2012 at 1:42 PM, waqas latif waqas...@gmail.com wrote:

So my question is that do hadoop 0.20 and 1.0.3 differ in their support
of
writing or reading sequencefiles? same code works fine with hadoop 0.20
but
problem occurs when run it under hadoop 1.0.3.

On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com
wrote:

But the thing is, it works with hadoop 0.20. even with 100 x100(and
even
bigger matrices) but when it comes to hadoop 1.0.3 then even there is
a
problem with 3x3 matrix.

On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi
prash1...@gmail.com wrote:

I have seen this issue with large file writes using SequenceFile
writer.
Not found the same issue when testing with writing fairly small files
(
1GB).

On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam
kasisubbu...@gmail.comwrote:

Hi,
If you are using a custom writable object while passing data from
the
mapper to the reducer make sure that the read fields and the write
has
the
same number of variables. It might be possible that you wrote
datavtova
file using custom writable but later modified the custom writable
(like
adding new attribute to the writable) which the old data doesn't
have.

It might be a possibility is please check once

On Friday, May 25, 2012, waqas latif wrote:

Hi Experts,

I am fairly new to hadoop MapR and I was trying to run a matrix
multiplication example presented by Mr. Norstadt under following
link
http://www.norstad.org/matrix-multiply/index.html. I can run it
successfully with hadoop 0.20.2 but I tried to run it with hadoop
1.0.3
but
I am getting following error. Is it the problem with my hadoop
configuration or it is compatibility problem in the code which was
written
in hadoop 0.20 by author.Also please guide me that how can I fix
this
error
in either case. Here is the error I am getting.

in thread main java.io.EOFException
at
java.io.DataInputStream.readFully(DataInputStream.java:180)
at
java.io.DataInputStream.readFully(DataInputStream.java:152)
at

org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
at

org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
at

org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
at

org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
at
TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60)
at
TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87)
at
TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112)
at
TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150)
at
TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278)
at TestMatrixMultiply.main(TestMatrixMultiply.java:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Thanks in advance

Regards,
waqas

--
Harsh J

Re: different input/output formats

2012-05-30 Thread samir das mohapatra

Hi
  I think attachment will not got thgrough the common-user@hadoop.apache.org.

Ok Please have a look bellow.

MAP

package test;

import java.io.IOException;

import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class myMapper extends MapReduceBase implements
MapperLongWritable,Text,FloatWritable,Text {

   public void map(LongWritable offset, Text
val,OutputCollectorFloatWritable,Text output, Reporter reporter)  throws
IOException {
   output.collect(new FloatWritable(1), val);
}
}

REDUCER
--
Prepare reducer  what exactly you want for.



JOB


package test;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class TestDemo extends Configured implements Tool{

public static void main(String args[]) throws Exception{

int res = ToolRunner.run(new Configuration(), new
TestDemo(),args);
System.exit(res);

}

@Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(TestDemo.class);
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
conf.setJobName(TestCustomInputOutput);


   conf.setMapperClass(myMapper.class);
   conf.setMapOutputKeyClass(FloatWritable.class);
   conf.setMapOutputValueClass(Text.class);
   conf.setNumReduceTasks(0);
   conf.setOutputKeyClass(FloatWritable.class);
   conf.setOutputValueClass(Text.class);

   conf.setInputFormat(TextInputFormat.class);
   conf.setOutputFormat(SequenceFileOutputFormat.class);

   TextInputFormat.addInputPath(conf, new Path(args[0]));
   SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
return 0;
}
}

On Wed, May 30, 2012 at 6:57 PM, samir das mohapatra 
samir.help...@gmail.com wrote:

 PFA.


 On Wed, May 30, 2012 at 2:45 AM, Mark question markq2...@gmail.comwrote:

 Hi Samir, can you email me your main class.. or if you can check mine, it
 is as follows:

 public class SortByNorm1 extends Configured implements Tool {

@Override public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir
 outputDir\n);
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
JobConf conf = new JobConf(new Configuration(),SortByNorm1.class);
conf.setJobName(SortDocByNorm1);
conf.setMapperClass(Norm1Mapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setReducerClass(Norm1Reducer.class);
 conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
 JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByNorm1(), args);
System.exit(exitCode);
 }


 On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra 
 samir.help...@gmail.com wrote:

  Hi Mark
 See the out put for that same  Application .
 I am  not getting any error.
 
 
  On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com
 wrote:
 
  Hi guys, this is a very simple  program, trying to use TextInputFormat
 and
  SequenceFileoutputFormat. Should be easy but I get the same error.
 
  Here is my configurations:
 
 conf.setMapperClass(myMapper.class);
 conf.setMapOutputKeyClass(FloatWritable.class);
 conf.setMapOutputValueClass(Text.class);
 conf.setNumReduceTasks(0);

Re: Writing click stream data to hadoop

2012-05-30 Thread Luke Lu

SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
0.20.205), which calls the underlying FSDataOutputStream#sync which is
actually hflush semantically (data not durable in case of data center
wide power outage). hsync implementation is not yet in 2.0. HDFS-744
just brought hsync in trunk.

__Luke

On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).

 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.

 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 We get click data through API calls. I now need to send this data to our
 hadoop environment. I am wondering if I could open one sequence file and
 write to it until it's of certain size. Once it's over the specified size I
 can close that file and open a new one. Is this a good approach?

 Only thing I worry about is what happens if the server crashes before I am
 able to cleanly close the file. Would I lose all previous data?



 --
 Harsh J

Re: Writing click stream data to hadoop

2012-05-30 Thread Harsh J

Thanks for correcting me there on the syncFs call Luke. I seemed to
have missed that method when searching branch-1 code.

On Thu, May 31, 2012 at 6:54 AM, Luke Lu l...@apache.org wrote:

 SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
 0.20.205), which calls the underlying FSDataOutputStream#sync which is
 actually hflush semantically (data not durable in case of data center
 wide power outage). hsync implementation is not yet in 2.0. HDFS-744
 just brought hsync in trunk.

 __Luke

 On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
  Mohit,
 
  Not if you call sync (or hflush/hsync in 2.0) periodically to persist
  your changes to the file. SequenceFile doesn't currently have a
  sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
  underlying output stream instead at the moment. This is possible to do
  in 1.0 (just own the output stream).
 
  Your use case also sounds like you may want to simply use Apache Flume
  (Incubating) [http://incubator.apache.org/flume/] that already does
  provide these features and the WAL-kinda reliability you seek.
 
  On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com 
  wrote:
  We get click data through API calls. I now need to send this data to our
  hadoop environment. I am wondering if I could open one sequence file and
  write to it until it's of certain size. Once it's over the specified size I
  can close that file and open a new one. Is this a good approach?
 
  Only thing I worry about is what happens if the server crashes before I am
  able to cleanly close the file. Would I lose all previous data?
 
 
 
  --
  Harsh J




--
Harsh J

Re: Dynamic Priority Scheduler

2012-05-30 Thread Harsh J

Hello cldo,

If you can be clearer on your scheduling/workload needs we can suggest
proper configuration/scheduler implementations for your cluster.

On Thu, May 31, 2012 at 9:49 AM, cldo cldo datk...@gmail.com wrote:
 i am using hadoop version 1.0 and capacity scheduler but not effective
 how to use Dynamic Priority Scheduler

-- 
Harsh J

Re: Pragmatic cluster backup strategies?

Re: Pragmatic cluster backup strategies?

RE: Best Practices for Upgrading Hadoop Version?

Hadoop BI Usergroup Stuttgart (Germany)

Re: Best Practices for Upgrading Hadoop Version?

Re: different input/output formats

Re: How to mapreduce in the scenario

Re: Pragmatic cluster backup strategies?

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

RE: How to mapreduce in the scenario

Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

Re: Writing click stream data to hadoop

Re: Writing click stream data to hadoop

Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

Re: Job/Task Log timestamp questions

Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

Re: different input/output formats

Re: Writing click stream data to hadoop

Re: Writing click stream data to hadoop

Re: Dynamic Priority Scheduler

22 matches

Site Navigation

Mail list logo

Footer information