Re: aggregation by time window

2013-01-28 Thread Kai Voigt
Hi again,

the idea is that you emit every event multiple times. So your map input record 
(event1, 10:07) will be emitted seven times during the map() call. Like I said, 
(10:04,event1), (10:05,event1), ..., (10:10,event1) will be the seven outputs 
for processing a single event.

The output key will be the time stamps in which neighbourhood or interval each 
event should be joined with events that happened +/- 3 minutes near it. So 
events which happened within a 7 minutes distance will both be emitted with the 
same time stamp as the map() output, and thus meet in a reduce() call.

A reduce() call will look like this: reduce(10:03, list_of_events). And those 
events had time stamps between 10:00 and 10:06 in the original input.

Kai

Am 28.01.2013 um 14:43 schrieb Oleg Ruchovets oruchov...@gmail.com:

 Hi Kai.
It is very interesting. Can you please explain in more details your
 Idea?
 What will be a key in a map phase?
 
 Suppose we have event at 10:07. How would you emit this to the multiple
 buckets?
 
 Thanks
 Oleg.
 
 
 On Mon, Jan 28, 2013 at 3:17 PM, Kai Voigt k...@123.org wrote:
 
 Quick idea:
 
 since each of your events will go into several buckets, you could use
 map() to emit each item multiple times for each bucket.
 
 Am 28.01.2013 um 13:56 schrieb Oleg Ruchovets oruchov...@gmail.com:
 
 Hi ,
   I have such row data structure:
 
 event_id  |   time
 ==
 event1 |  10:07
 event2 |  10:10
 event3 |  10:12
 
 event4 |   10:20
 event5 |   10:23
 event6 |   10:25
 
 map(event1,10:07) would emit (10:04,event1), (10:05,event1), ...,
 (10:10,event1) and so on.
 
 In reduce(), all your desired events would meet for the same minute.
 
 Kai
 
 --
 Kai Voigt
 k...@123.org
 
 
 
 
 

-- 
Kai Voigt
k...@123.org






Re: Transfer large file 50Gb with DistCp from s3 to cluster

2012-09-04 Thread Kai Voigt
Hi,

my guess is that you run hadoop distcp on one of the datanodes... In that 
case, the node will get the first replica of each block. But you should also 
see copies on more nodes as well. But that one node will get a replica of all 
the blocks.

Kai

Am 04.09.2012 um 22:07 schrieb Soulghost juanfeli...@gmail.com:

 
 Hello guys 
 
 I have a problem using the DistCp to transfer a large file from s3 to HDFS
 cluster, whenever I tried to make the copy, I only saw processing work and
 memory usage in one of the nodes, not in all of them, I don't know if this
 is the proper behaviour of this or if it is a configuration problem. If I
 make the transfer of multiple files each node handles a single file at the
 same time, I understand that this transfer would be in parallel but it
 doesn't seems like that. 
 
 I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I
 was hoping that any of you have an idea of how it works distCp and which
 properties could I tweak to improve the transfer rate that is currently in
 0.7 Gb per minute. 
 
 Regards.
 -- 
 View this message in context: 
 http://old.nabble.com/Transfer-large-file-%3E50Gb-with-DistCp-from-s3-to-cluster-tp34389118p34389118.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 

-- 
Kai Voigt
k...@123.org






Re: Hadoop or HBase

2012-08-28 Thread Kai Voigt
Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.

Which CMS do you plan to use, and what's wrong with MySQL or other open source 
RDBMSs?

Kai

Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com:

 Hi,
 I wants to use DFS for Content-Management-System (CMS), in 
 that I just wants to store and retrieve files.
 Please suggest me what should I use:
 Hadoop or HBase
  
 Thanks  Regards,
 Kushal Agrawal
 kushalagra...@teledna.com
  
 One Earth. Your moment. Go green...
 This message is for the designated recipient only and may contain privileged, 
 proprietary, or otherwise private information. If you have received it in 
 error, please notify the sender immediately and delete the original. Any 
 other use of the email by you is prohibited.
  

-- 
Kai Voigt
k...@123.org






Re: Hadoop or HBase

2012-08-28 Thread Kai Voigt
Having a distributed filesystem doesn't save you from having backups. If 
someone deletes a file in HDFS, it's gone.

What backend storage is supported by your CMS?

Kai

Am 28.08.2012 um 08:36 schrieb Kushal Agrawal kushalagra...@teledna.com:

 As the data is too much in (10's of terabytes) it's difficult to take backup
 because it takes 1.5 days to take backup of data every time. Instead of that
 if we uses distributed file system we need not to do that.
 
 Thanks  Regards,
 Kushal Agrawal
 kushalagra...@teledna.com
  
 -Original Message-
 From: Kai Voigt [mailto:k...@123.org] 
 Sent: Tuesday, August 28, 2012 11:57 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop or HBase
 
 Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.
 
 Which CMS do you plan to use, and what's wrong with MySQL or other open
 source RDBMSs?
 
 Kai
 
 Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com:
 
 Hi,
I wants to use DFS for Content-Management-System (CMS), in
 that I just wants to store and retrieve files.
 Please suggest me what should I use:
 Hadoop or HBase
 
 Thanks  Regards,
 Kushal Agrawal
 kushalagra...@teledna.com
 
 One Earth. Your moment. Go green...
 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the email by you is prohibited.
 
 
 -- 
 Kai Voigt
 k...@123.org
 
 
 
 
 
 

-- 
Kai Voigt
k...@123.org






Re: Simple hadoop processes/testing on windows machine

2012-07-25 Thread Kai Voigt
I suggest using a virtual machine with all required services installed and 
configured.

Cloudera offers a distribution as a VM, at 
https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH4PackagesandDownloads

So all you need is installing VMware player on your Windows box and deploy that 
VM.

Kai

Am 25.07.2012 um 22:49 schrieb Brown, Berlin [GCG-PFS] 
berlin.br...@primerica.com:

 Is there a tutorial out there or quick startup that would allow me to:
 
 
 
 1.   Start my hadoop nodes by double clicking on a node, possibly a
 single node
 
 2.   Installing my Task
 
 3.   And then my client connects to those nodes.
 
 
 
 I want to avoid having to install a hadoop windows service or any such
 installs.  Actually, I don't want to install anything for hadoop to run,
 I want to be able to unarchive the software and just double-click and
 run.  I also want to avoid additional hadoop windows registry or env
 params?
 
 
 
 -
 
 
 

-- 
Kai Voigt
k...@123.org






Re: Counting records

2012-07-23 Thread Kai Voigt
Hi,

an additional idea is to use the counter API inside the framework.

http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ has 
a good example.

Kai

Am 23.07.2012 um 16:25 schrieb Peter Marron:

 I am a complete noob with Hadoop and MapReduce and I have a question that
 is probably silly, but I still don't know the answer.
 
 To simplify (a fair bit) I want to count all the records that meet specific 
 criteria.
 I would like to use MapReduce because I anticipate large sources and I want to
 get the performance and reliability that MapReduce offers.

-- 
Kai Voigt
k...@123.org






Re: HDFS APPEND

2012-07-13 Thread Kai Voigt
Hello,

Am 13.07.2012 um 22:19 schrieb abhiTowson cal:

 Does CDH4 have append option??

Yes!!

Check out 
http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path)

Kai

-- 
Kai Voigt
k...@123.org






Re: Binary Files With No Record Begin and End

2012-07-05 Thread Kai Voigt
Hi,

if you know the block size, you can calculate the offsets for your records. And 
write a custom record reader class to seek into your records.

Kai

Am 05.07.2012 um 22:54 schrieb MJ Sam:

 Hi,
 
 The input of my map reduce is a binary file with no record begin and
 end marker. The only thing is that each record is a fixed 180bytes
 size in the binary file. How do I make Hadoop to properly find the
 record in the splits when a record overlap two splits. I was thinking
 to make the splits size to be a multiple of 180 but was wondering if
 there is anything else that I can do?  Please note that my files are
 not sequence file and just a custom binary file.
 

-- 
Kai Voigt
k...@123.org






Re: file checksum

2012-06-25 Thread Kai Voigt
HDFS has block checksums. Whenever a block is written to the datanodes, a 
checksum is calculated and written with the block to the datanodes' disks.

Whenever a block is requested, the block's checksum is verified against the 
stored checksum. If they don't match, that block is corrupt. But since there's
additional replicas of the block, chances are high one block is matching the 
checksum. Corrupt blocks will be scheduled to be rereplicated.

Also, to prevent bit rod, blocks are checked periodically (weekly by default, I 
believe, you can configure that period) in the background.

Kai

Am 25.06.2012 um 13:29 schrieb Rita:

 Does Hadoop, HDFS in particular, do any sanity checks of the file before
 and after balancing/copying/reading the files? We have 20TB of data and I
 want to make sure after these operating are completed the data is still in
 good shape. Where can I read about this?
 
 tia
 
 -- 
 --- Get your facts first, then you can distort them as you please.--

-- 
Kai Voigt
k...@123.org






Re: map task execution time

2012-04-05 Thread Kai Voigt
Hi,

Am 05.04.2012 um 00:20 schrieb bikash sharma:

 Is it possible to get the execution time of the constituent map/reduce
 tasks of a MapReduce job (say sort) at the end of a job run?
 Preferably, can we obtain this programatically?


you can access the JobTracker's web UI and see the start and stop timestamps 
for every individual task.

Since the JobTracker Java API is exposed, you can write your own application to 
fetch that data through your own code.

Also, hadoop job on the command line can be used to read job statistics.

Kai


-- 
Kai Voigt
k...@123.org






Re: How does Hadoop compile the program written in language other than Java ?

2012-03-04 Thread Kai Voigt
Hi,

the streaming API doesn't compile the streaming scripts.

The PHP/Perl/Python/Ruby scripts you create as mapper and reducer will be 
called as external programs.

The input key/value pairs will be send to your scripts as stdin, and the output 
will be collected from their stdout.

So, no compilation, the scripts will just be executed.

Kai

Am 04.03.2012 um 15:42 schrieb Lac Trung:

 Hi everyone !
 
 Hadoop is written in Java, so mapreduce programs are written in Java, too.
 But Hadoop provides an API to MapReduce that allows you to write your map
 and reduce functions in languages other than Java (ex. Python), called
 Hadoop Streaming.
 I read the guide of Hadoop Streaming in
 herehttp://www.hadoop.apache.org/common/docs/r0.15.2/streaming.html#More+usage+examples
 but
 I haven't seen any paragraph that write about converting language to Java.
 Can anybody tell me how Hadoop compile the program written in language
 other than Java.
 
 Thank you !
 -- 
 Lac Trung

-- 
Kai Voigt
k...@123.org






Re: How does Hadoop compile the program written in language other than Java ?

2012-03-04 Thread Kai Voigt
Hi,

please read http://developer.yahoo.com/hadoop/tutorial/module4.html#streaming 
and http://developer.yahoo.com/hadoop/tutorial/module4.html#streaming for more 
explanation and examples.

Kai

Am 04.03.2012 um 16:10 schrieb Lac Trung:

 Can you give one or some examples about this, Kai ?
 I haven't understood how Hadoop run a mapreduce program in other language :D
 
 Vào 02:21 Ngày 04 tháng 3 năm 2012, Kai Voigt k...@123.org đã viết:
 
 Hi,
 
 the streaming API doesn't compile the streaming scripts.
 
 The PHP/Perl/Python/Ruby scripts you create as mapper and reducer will be
 called as external programs.
 
 The input key/value pairs will be send to your scripts as stdin, and the
 output will be collected from their stdout.
 
 So, no compilation, the scripts will just be executed.
 
 Kai
 
 Am 04.03.2012 um 15:42 schrieb Lac Trung:
 
 Hi everyone !
 
 Hadoop is written in Java, so mapreduce programs are written in Java,
 too.
 But Hadoop provides an API to MapReduce that allows you to write your map
 and reduce functions in languages other than Java (ex. Python), called
 Hadoop Streaming.
 I read the guide of Hadoop Streaming in
 here
 http://www.hadoop.apache.org/common/docs/r0.15.2/streaming.html#More+usage+examples
 
 but
 I haven't seen any paragraph that write about converting language to
 Java.
 Can anybody tell me how Hadoop compile the program written in language
 other than Java.
 
 Thank you !
 --
 Lac Trung
 
 --
 Kai Voigt
 k...@123.org
 
 
 
 
 
 
 
 -- 
 Lạc Trung
 20083535

-- 
Kai Voigt
k...@123.org






Re: dfs.block.size

2012-02-27 Thread Kai Voigt
hadoop fsck filename -blocks is something that I think of quickly.

http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck has more 
details

Kai

Am 28.02.2012 um 02:30 schrieb Mohit Anchlia:

 How do I verify the block size of a given file? Is there a command?
 
 On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria j...@cloudera.com wrote:
 
 dfs.block.size can be set per job.
 
 mapred.tasktracker.map.tasks.maximum is per tasktracker.
 
 -Joey
 
 On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 Can someone please suggest if parameters like dfs.block.size,
 mapred.tasktracker.map.tasks.maximum are only cluster wide settings or
 can
 these be set per client job configuration?
 
 On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
 If I want to change the block size then can I use Configuration in
 mapreduce job and set it when writing to the sequence file or does it
 need
 to be cluster wide setting in .xml files?
 
 Also, is there a way to check the block of a given file?
 
 
 
 
 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 

-- 
Kai Voigt
k...@123.org






Re: Secondary namenode fsimage concept

2011-10-06 Thread Kai Voigt
Hi,

you're correct when saying the namenode hosts the fsimage file and the edits 
log file.

The fsimage file contains a snapshot of the HDFS metadata (a filename to blocks 
list mapping). Whenever there is a change to HDFS, it will be appended to the 
edits file. Think of it as a database transaction log, where changes will not 
be applied to the datafile, but appended to a log.

To prevent the edits file growing infinitely, the secondary namenode 
periodically pulls these two files, and the namenode starts writing changes to 
a new edits file. Then, the secondary namenode merges the changes from the 
edits file with the old snapshot from the fsimage file and creates an updated 
fsimage file. This updated fsimage file is then copied to the namenode.

Then, the entire cycle starts again. To answer your question: The namenode has 
both files, even if the secondary namenode is running on a different machine.

Kai

Am 06.10.2011 um 07:57 schrieb shanmuganathan.r:

 
 Hi All,
 
I have a doubt in hadoop secondary namenode concept . Please 
 correct if the following statements are wrong .
 
 
 The namenode hosts the fsimage and edit log files. The secondary namenode 
 hosts the fsimage file only. At the time of checkpoint the edit log file is 
 transferred to the secondary namenode and the both files are merged, Then the 
 updated fsimage file is transferred to the namenode . Is it correct?
 
 
 If we run the secondary namenode in separate machine , then both machines 
 contain the fsimage file . Namenode only contains the editlog file. Is it 
 true?
 
 
 
 Thanks R.Shanmuganathan  
 
 
 
 
 
 

-- 
Kai Voigt
k...@123.org






Re: Secondary namenode fsimage concept

2011-10-06 Thread Kai Voigt
Hi,

the secondary namenode only fetches the two files when a checkpointing is 
needed.

Kai

Am 06.10.2011 um 08:45 schrieb shanmuganathan.r:

 Hi Kai,
 
  In the Second part I meant 
 
 
 Is the secondary namenode also contain the FSImage file or the two 
 files(FSImage and EdiltLog) are transferred from the namenode at the 
 checkpoint time.
 
 
 Thanks 
 Shanmuganathan
 
 
 
 
 
  On Thu, 06 Oct 2011 11:37:50 +0530 Kai Voigtlt;k...@123.orggt; wrote 
  
 
 
 Hi, 
 
 you're correct when saying the namenode hosts the fsimage file and the edits 
 log file. 
 
 The fsimage file contains a snapshot of the HDFS metadata (a filename to 
 blocks list mapping). Whenever there is a change to HDFS, it will be appended 
 to the edits file. Think of it as a database transaction log, where changes 
 will not be applied to the datafile, but appended to a log. 
 
 To prevent the edits file growing infinitely, the secondary namenode 
 periodically pulls these two files, and the namenode starts writing changes 
 to a new edits file. Then, the secondary namenode merges the changes from the 
 edits file with the old snapshot from the fsimage file and creates an updated 
 fsimage file. This updated fsimage file is then copied to the namenode. 
 
 Then, the entire cycle starts again. To answer your question: The namenode 
 has both files, even if the secondary namenode is running on a different 
 machine. 
 
 Kai 
 
 Am 06.10.2011 um 07:57 schrieb shanmuganathan.r: 
 
 gt; 
 gt; Hi All, 
 gt; 
 gt; I have a doubt in hadoop secondary namenode concept . Please correct if 
 the following statements are wrong . 
 gt; 
 gt; 
 gt; The namenode hosts the fsimage and edit log files. The secondary 
 namenode hosts the fsimage file only. At the time of checkpoint the edit log 
 file is transferred to the secondary namenode and the both files are merged, 
 Then the updated fsimage file is transferred to the namenode . Is it correct? 
 gt; 
 gt; 
 gt; If we run the secondary namenode in separate machine , then both 
 machines contain the fsimage file . Namenode only contains the editlog file. 
 Is it true? 
 gt; 
 gt; 
 gt; 
 gt; Thanks R.Shanmuganathan 
 gt; 
 gt; 
 gt; 
 gt; 
 gt; 
 gt; 
 
 -- 
 Kai Voigt 
 k...@123.org 
 
 
 
 
 
 
 

-- 
Kai Voigt
k...@123.org






Re: Secondary namenode fsimage concept

2011-10-06 Thread Kai Voigt
Hi,

yes, the secondary namenode is actually a badly named piece of software, as 
it's not a namenode at all. And it's going to be renamed to checkpoint node.

To prevent metadata loss when your namenode fails, you should write the 
namenode files to a local RAID and also a networked storage (NFS, SAN, DRBD). 
It's not the secondary namenode's task to make the metadata available.

Kai

Am 06.10.2011 um 09:04 schrieb shanmuganathan.r:

 Hi Kai,
 
  There is no datas stored  in the secondarynamenode related to the Hadoop 
 cluster . Am I correct?
 If it correct means If we run the secondaryname node in separate machine then 
 fetching , merging and transferring time is increased if the cluster has 
 large data in the namenode fsimage file . At the time if fail over occurs , 
 then how can we recover the nearly one hour changes in the HDFS file ? 
 (default check point time is one hour)
 
 Thanks R.Shanmuganathan  
 
 
 
 
 
 
  On Thu, 06 Oct 2011 12:20:28 +0530 Kai Voigtlt;k...@123.orggt; wrote 
  
 
 
 Hi, 
 
 the secondary namenode only fetches the two files when a checkpointing is 
 needed. 
 
 Kai 
 
 Am 06.10.2011 um 08:45 schrieb shanmuganathan.r: 
 
 gt; Hi Kai, 
 gt; 
 gt; In the Second part I meant 
 gt; 
 gt; 
 gt; Is the secondary namenode also contain the FSImage file or the two 
 files(FSImage and EdiltLog) are transferred from the namenode at the 
 checkpoint time. 
 gt; 
 gt; 
 gt; Thanks 
 gt; Shanmuganathan 
 gt; 
 gt; 
 gt; 
 gt; 
 gt; 
 gt;  On Thu, 06 Oct 2011 11:37:50 +0530 Kai 
 Voigtamp;lt;k...@123.orgamp;gt; wrote  
 gt; 
 gt; 
 gt; Hi, 
 gt; 
 gt; you're correct when saying the namenode hosts the fsimage file and the 
 edits log file. 
 gt; 
 gt; The fsimage file contains a snapshot of the HDFS metadata (a filename to 
 blocks list mapping). Whenever there is a change to HDFS, it will be appended 
 to the edits file. Think of it as a database transaction log, where changes 
 will not be applied to the datafile, but appended to a log. 
 gt; 
 gt; To prevent the edits file growing infinitely, the secondary namenode 
 periodically pulls these two files, and the namenode starts writing changes 
 to a new edits file. Then, the secondary namenode merges the changes from the 
 edits file with the old snapshot from the fsimage file and creates an updated 
 fsimage file. This updated fsimage file is then copied to the namenode. 
 gt; 
 gt; Then, the entire cycle starts again. To answer your question: The 
 namenode has both files, even if the secondary namenode is running on a 
 different machine. 
 gt; 
 gt; Kai 
 gt; 
 gt; Am 06.10.2011 um 07:57 schrieb shanmuganathan.r: 
 gt; 
 gt; amp;gt; 
 gt; amp;gt; Hi All, 
 gt; amp;gt; 
 gt; amp;gt; I have a doubt in hadoop secondary namenode concept . Please 
 correct if the following statements are wrong . 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; The namenode hosts the fsimage and edit log files. The 
 secondary namenode hosts the fsimage file only. At the time of checkpoint the 
 edit log file is transferred to the secondary namenode and the both files are 
 merged, Then the updated fsimage file is transferred to the namenode . Is it 
 correct? 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; If we run the secondary namenode in separate machine , then 
 both machines contain the fsimage file . Namenode only contains the editlog 
 file. Is it true? 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; Thanks R.Shanmuganathan 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; amp;gt; 
 gt; 
 gt; -- 
 gt; Kai Voigt 
 gt; k...@123.org 
 gt; 
 gt; 
 gt; 
 gt; 
 gt; 
 gt; 
 gt; 
 
 -- 
 Kai Voigt 
 k...@123.org 
 
 
 
 
 
 

-- 
Kai Voigt
k...@123.org






Re: phases of Hadoop Jobs

2011-09-19 Thread Kai Voigt
Hi Chen,

yes, it saves time to move map() output to the nodes where they will be needed 
for the reduce() input. After map() has processed the first blocks, it makes 
sense to copy that output to the reduce nodes. Imagine a very large map() 
output. If shufflecopy would be postponed after all map nodes are done, we'd 
wait. So those things happen in parallel.

Consider it like Unix pipes where you start processing the output of the first 
command as the input of the next command.

$ command1 | command2

as opposed to store the output first and then process it.

$ command1  file ; $command2  file

Kai

Am 19.09.2011 um 08:20 schrieb He Chen:

 Hi Kai
 
 Thank you  for the reply.
 
 The reduce() will not start because the shuffle phase does not finish. And
 the shuffle phase will not finish untill alll mapper end.
 
 I am curious about the design purpose about overlapping the map and reduce
 stage. Was this only for saving shuffling time? Or there are some other
 reasons.
 
 Best wishes!
 
 Chen
 On Mon, Sep 19, 2011 at 12:36 AM, Kai Voigt k...@123.org wrote:
 
 Hi Chen,
 
 the times when nodes running instances of the map and reduce nodes overlap.
 But map() and reduce() execution will not.
 
 reduce nodes will start copying data from map nodes, that's the shuffle
 phase. And the map nodes are still running during that copy phase. My
 observation had been that if the map phase progresses from 0 to 100%, it
 matches with the reduce phase progress from 0-33%. For example, if you map
 progress shows 60%, reduce might show 20%.
 
 But the reduce() will not start until all the map() code has processed the
 entire input. So you will never see the reduce progress higher than 66% when
 map progress didn't reach 100%.
 
 If you see map phase reaching 100%, but reduce phase not making any higher
 number than 66%, it means your reduce() code is broken or slow because it
 doesn't produce any output. An infinitive loop is a common mistake.
 
 Kai
 
 Am 19.09.2011 um 07:29 schrieb He Chen:
 
 Hi Arun
 
 I have a question. Do you know what is the reason that hadoop allows the
 map
 and the reduce stage overlap? Or anyone knows about it. Thank you in
 advance.
 
 Chen
 
 On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy a...@hortonworks.com
 wrote:
 
 Nan,
 
 The 'phase' is implicitly understood by the 'progress' (value) made by
 the
 map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
 
 For e.g.
 Reduce:
 0-33% - Shuffle
 34-66% - Sort (actually, just 'merge', there is no sort in the reduce
 since all map-outputs are sorted)
 67-100% - Reduce
 
 With 0.23 onwards the Map has phases too:
 0-90% - Map
 91-100% - Final Sort/merge
 
 Now,about starting reduces early - this is done to ensure shuffle can
 proceed for completed maps while rest of the maps run, there-by
 pipelining
 shuffle and map completion. There is a 'reduce slowstart' feature to
 control
 this - by default, reduces aren't started until 5% of maps are complete.
 Users can set this higher.
 
 Arun
 
 On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
 
 Hi, all
 
 recently, I was hit by a question, how is a hadoop job divided into 2
 phases?,
 
 In textbooks, we are told that the mapreduce jobs are divided into 2
 phases,
 map and reduce, and for reduce, we further divided it into 3 stages,
 shuffle, sort, and reduce, but in hadoop codes, I never think about
 this question, I didn't see any variable members in JobInProgress class
 to indicate this information,
 
 and according to my understanding on the source code of hadoop, the
 reduce
 tasks are unnecessarily started until all mappers are finished, in
 constract, we can see the reduce tasks are in shuffle stage while there
 are
 mappers which are still in running,
 So how can I indicate the phase which the job is belonging to?
 
 Thanks
 --
 Nan Zhu
 School of Electronic, Information and Electrical Engineering,229
 Shanghai Jiao Tong University
 800,Dongchuan Road,Shanghai,China
 E-Mail: zhunans...@gmail.com
 
 
 
 --
 Kai Voigt
 k...@123.org
 
 
 
 
 

-- 
Kai Voigt
k...@123.org






Re: Reducer to concatenate string values

2011-09-19 Thread Kai Voigt
Hi Daniel,

the values for a single key will be passed to reduce() in a non-predictable 
order. Actually, when running the same job on the same data again, the order is 
most likely different every time.

If you want the values to be in a sorted way, you need to apply a 'secondary 
sort'. The basic idea is to attach your values to the key, and then benefit 
from the sorting Hadoop does on the key.

However, you need to write some code to make that happen. Josh wrote a nice 
series of articles on it, and you will find more if you google for secondary 
sort.

http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/
http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-2/
http://www.cloudera.com/blog/2011/04/simple-moving-average-secondary-sort-and-mapreduce-part-3/

Kai

Am 20.09.2011 um 07:43 schrieb Daniel Yehdego:

 
 Good evening, 
 I have a certain value output from a mapper and I want to concatenate the 
 string outputs using a Reducer (one reducer).But the order of the 
 concatenated string values is not in order. How can I use a reducer that 
 receives a value from a mapper output and concatenate the strings in order. 
 waiting your response and thanks in advance.  
 
 Regards, 
 
 Daniel T. Yehdego
 Computational Science Program 
 University of Texas at El Paso, UTEP 
 dtyehd...@miners.utep.edu   

-- 
Kai Voigt
k...@123.org






Re: phases of Hadoop Jobs

2011-09-18 Thread Kai Voigt
Hi,

this 0-33-66-100% phases are really confusing to beginners. We see that in our 
training classes. The output should be more verbose, such as breaking down the 
phases into seperate progress numbers.

Does that make sense?

Am 19.09.2011 um 06:17 schrieb Arun C Murthy:

 Nan,
 
 The 'phase' is implicitly understood by the 'progress' (value) made by the 
 map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
 
 For e.g. 
 Reduce: 
 0-33% - Shuffle
 34-66% - Sort (actually, just 'merge', there is no sort in the reduce since 
 all map-outputs are sorted)
 67-100% - Reduce
 
 With 0.23 onwards the Map has phases too:
 0-90% - Map
 91-100% - Final Sort/merge
 
 Now,about starting reduces early - this is done to ensure shuffle can proceed 
 for completed maps while rest of the maps run, there-by pipelining shuffle 
 and map completion. There is a 'reduce slowstart' feature to control this - 
 by default, reduces aren't started until 5% of maps are complete. Users can 
 set this higher.
 
 Arun
 
 On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
 
 Hi, all
 
 recently, I was hit by a question, how is a hadoop job divided into 2
 phases?,
 
 In textbooks, we are told that the mapreduce jobs are divided into 2 phases,
 map and reduce, and for reduce, we further divided it into 3 stages,
 shuffle, sort, and reduce, but in hadoop codes, I never think about
 this question, I didn't see any variable members in JobInProgress class
 to indicate this information,
 
 and according to my understanding on the source code of hadoop, the reduce
 tasks are unnecessarily started until all mappers are finished, in
 constract, we can see the reduce tasks are in shuffle stage while there are
 mappers which are still in running,
 So how can I indicate the phase which the job is belonging to?
 
 Thanks
 -- 
 Nan Zhu
 School of Electronic, Information and Electrical Engineering,229
 Shanghai Jiao Tong University
 800,Dongchuan Road,Shanghai,China
 E-Mail: zhunans...@gmail.com
 
 

-- 
Kai Voigt
k...@123.org






Re: Hadoop with Netapp

2011-08-25 Thread Kai Voigt
Hi,

Am 25.08.2011 um 08:58 schrieb Hakan İlter:

 We are going to create a new Hadoop cluster in our company, i have to get
 some advises from you:
 
 1. Does anyone have stored whole Hadoop data not on local disks but
 on Netapp or other storage system? Do we have to store datas on local disks,
 if so is it because of performace issues?

HDFS and MapReduce benefit massively from local storage, so using any kind of 
remote storage (SAN, Amazon S3, etc) will make things slower.

 2. What do you think about running Hadoop nodes in virtual (VMware)
 servers?


Virtualization can make certain things easier to handle, but it's a layer that 
will eat resources.

Kai

-- 
Kai Voigt
k...@123.org






Re: Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-13 Thread Kai Voigt
Hi,

some search on Google would have told you. Here's one link:

http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache/hadoop/examples/?r=131

Kai

Am 13.08.2011 um 15:27 schrieb Sean Hogan:

 Hi all,
 
 I was interested in learning from how Hadoop implements their sort algorithm
 in the map/reduce framework. Could someone point me to the directory of the
 source code that has the mapper/reducer that the Sort example uses by
 default when I invoke:
 
 $ hadoop jar hadoop-*-examples.jar sort input output
 
 Thanks. I've found Sort.java here :
 
 http://svn.apache.org/viewvc/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/
 
 But have not been able to track down the mapper/reducer implementation.
 
 -Sean Hogan

-- 
Kai Voigt
k...@123.org






Re: Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-13 Thread Kai Voigt
Hi,

the Identity Mapper and Reducer do what the name implies, they pretty much 
return their input as their output.

TeraSort relies on the sorting that is built in Hadoop's SortShuffle phase.

So, the map() method in TeraSort looks like this:

map(offset, line) - (line, _)

offset is the key to map() and represents the byte offset of the line (which is 
the value). map() returns the line as the key and some value which is not 
needed.

reduce() looks like this:

reduce(line, values) - (line)

Again, the input is returned as is. The sortshuffle layer between map() and 
reduce() guarantees that keys (lines) will come in sorted order. That's why the 
overall output will be the sorted input.

This all is easy when there's just one reducer. Question to make sure you 
understood things so far: What's the issue with more than one reducer?

Kai

Am 13.08.2011 um 17:10 schrieb Sean Hogan:

 Thanks for the link, but it hasn't helped answer my original question - that
 Sort.java seems to use IdentityMapper and IdentityReducer. Perhaps it is the
 Sort.java that is used when executing the below command, but I can't figure
 out what it actually uses for the mapper and reducer. It's entirely possible
 I'm just missing something obvious.
 
 I'm interested in seeing how the map and reduce fits into sorting with the
 following command:
 
 $ hadoop jar hadoop-*-examples.jar sort input output
 
 I'd appreciate it if someone could explain what mappers/reducers are used in
 that above command (link to the implementation of whatever sort they use and
 how it fits into MapReduce)
 
 Thanks.
 
 -Sean

-- 
Kai Voigt
k...@123.org






Re: Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-13 Thread Kai Voigt
Good job, in MapReduce you can build your own Partitioner. That is code 
determining which reducer will get which keys.

For simplicity, assume you're running 26 reducers. Your custom Partitioner will 
make sure the first reducer gets all keys starting with 'a', and so on.

Since the keys will be sorted within a single reducer, you can concatenate your 
26 output files to get an overall sorted output.

Making sense?

Kai

Am 13.08.2011 um 17:44 schrieb Sean Hogan:

 Oh, okay, got it - if there was more than one reducer then there needs to be
 a way to guarantee that the overall output from multiple reducers will still
 be sorted.
 
 So I want to look for where the implementation of the shuffle/sort phase is
 located. Or find something on how Hadoop implements the MapReduce
 sort/shuffle phase.
 
 Thanks!
 
 -Sean

-- 
Kai Voigt
k...@123.org






Re: Question about RAID controllers and hadoop

2011-08-11 Thread Kai Voigt
Yahoo did some testing 2 years ago: http://markmail.org/message/xmzc45zi25htr7ry

But updated benchmark would be interesting to see.

Kai

Am 12.08.2011 um 00:13 schrieb GOEKE, MATTHEW (AG/1000):

 My assumption would be that having a set of 4 raid 0 disks would actually be 
 better than having a controller that allowed pure JBOD of 4 disks due to the 
 cache on the controller. If anyone has any personal experience with this I 
 would love to know performance numbers but our infrastructure guy is doing 
 tests on exactly this over the next couple days so I will pass it along once 
 we have it.
 
 Matt
 
 -Original Message-
 From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] 
 Sent: Thursday, August 11, 2011 5:00 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Question about RAID controllers and hadoop
 
 True, you need a P410 controller. You can create RAID0 for each disk to make 
 it as JBOD.
 
 
 -Bharath
 
 
 
 
 From: Koert Kuipers ko...@tresata.com
 To: common-user@hadoop.apache.org
 Sent: Thursday, August 11, 2011 2:50 PM
 Subject: Question about RAID controllers and hadoop
 
 Hello all,
 We are considering using low end HP proliant machines (DL160s and DL180s)
 for cluster nodes. However with these machines if you want to do more than 4
 hard drives then HP puts in a P410 raid controller. We would configure the
 RAID controller to function as JBOD, by simply creating multiple RAID
 volumes with one disk. Does anyone have experience with this setup? Is it a
 good idea, or am i introducing a i/o bottleneck?
 Thanks for your help!
 Best, Koert
 This e-mail message may contain privileged and/or confidential information, 
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error, 
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use 
 of this e-mail by you is strictly prohibited.
 
 All e-mails and attachments sent and received are subject to monitoring, 
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for checking 
 for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage 
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.
 
 
 The information contained in this email may be subject to the export control 
 laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR) and 
 sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.
 
 

-- 
Kai Voigt
k...@123.org






Re: Where is web interface in stand alone operation?

2011-08-10 Thread Kai Voigt
Hi,

just connect to http://localhost:50070/ or http://localhost:50030/ to access 
the web interfaces.

Kai

Am 10.08.2011 um 14:31 schrieb A Df:

 Dear All:
 
 I know that in pseudo mode that there is a web interface for the NameNode and 
 the JobTracker but where is it for the standalone operation? The Hadoop page 
 at http://hadoop.apache.org/common/docs/current/single_node_setup.html just 
 shows to run the jar example but how do you view job details? For example 
 time to complete etc. I know it will not be as detailed as the other modes 
 but I wanted to compare the job peformance in standalone vs pseudo mode. 
 Thank you.
 
 
 Cheers,
 A Df

-- 
Kai Voigt
k...@123.org






Re: Where is web interface in stand alone operation?

2011-08-10 Thread Kai Voigt
Hi,

for further clarification. Are you running in standalone mode (1 JVM which runs 
everything) or in pseudo-distributed mode (1 Machine, but with 5 JVMs)? With 
PDM, you can access the web interfaces on ports 50030 and 50070. With SM, you 
should be able to at least to process monitoring (unix time command, various 
trace commands)

Kai

Am 10.08.2011 um 15:58 schrieb A Df:

 
 
 Hello Harsh:
 
 See inline at *
 
 
 
 From: Harsh J ha...@cloudera.com
 To: common-user@hadoop.apache.org; A Df abbey_dragonfor...@yahoo.com
 Sent: Wednesday, 10 August 2011, 14:44
 Subject: Re: Where is web interface in stand alone operation?
 
 A Df,
 
 The web UIs are a feature of the daemons JobTracker and NameNode. In
 standalone/'local'/'file:///' modes, these daemons aren't run
 (actually, no daemon is run at all), and hence there would be no 'web'
 interface.
 
 *ok, but is there any other way to check the performance in this mode such 
 as time to complete etc? I am trying to compare performance between the two. 
 And also for the pseudo mode how would I change the ports for the web 
 interface because I have to connect to a remote server which only allows 
 certain ports to be accessed from the web?
 
 On Wed, Aug 10, 2011 at 6:01 PM, A Df abbey_dragonfor...@yahoo.com wrote:
 Dear All:
 
 I know that in pseudo mode that there is a web interface for the NameNode 
 and the JobTracker but where is it for the standalone operation? The Hadoop 
 page at http://hadoop.apache.org/common/docs/current/single_node_setup.html 
 just shows to run the jar example but how do you view job details? For 
 example time to complete etc. I know it will not be as detailed as the 
 other modes but I wanted to compare the job peformance in standalone vs 
 pseudo mode. Thank you.
 
 
 Cheers,
 A Df
 
 
 
 
 -- 
 Harsh J
 
 

-- 
Kai Voigt
k...@123.org






Re: Can I number output results with a Counter?

2011-05-20 Thread Kai Voigt
Also, with speculative execution enabled, you might see a higher count as you 
expect while the same task is running multiple times in parallel. When a task 
gets killed because another instance was quicker, those counters will be 
removed from the global count though.

Kai

Am 20.05.2011 um 19:34 schrieb Joey Echeverria:

 Counters are a way to get status from your running job. They don't
 increment a global state. They locally save increments and
 periodically report those increments to the central counter. That
 means that the final count will be correct, but you can't use them to
 coordinate counts while your job is running.
 
 -Joey
 
 On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner markkerz...@gmail.com wrote:
 Joey,
 
 You understood me perfectly well. I see your first advice, but I am not
 allowed to have gaps. A central service is something I may consider if
 single reducer becomes a worse bottleneck than it.
 
 But what are counters for? They seem to be exactly that.
 
 Mark
 
 On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria j...@cloudera.com wrote:
 
 To make sure I understand you correctly, you need a globally unique
 one up counter for each output record?
 
 If you had an upper bound on the number of records a single reducer
 could output and you can afford to have gaps, you could just use the
 task id and multiply that by the max number of records and then one up
 from there.
 
 If that doesn't work for you, then you'll need to use some kind of
 central service for allocating numbers which could become a
 bottleneck.
 
 -Joey
 
 On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com
 wrote:
 Hi, can I use a Counter to give each record in all reducers a consecutive
 number? Currently I am using a single Reducer, but it is an anti-pattern.
 But I need to assign consecutive numbers to all output records in all
 reducers, and it does not matter how, as long as each gets its own
 number.
 
 If it IS possible, then how are multiple processes accessing those
 counters
 without creating race conditions.
 
 Thank you,
 
 Mark
 
 
 
 
 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 
 
 
 
 
 -- 
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 

-- 
Kai Voigt
k...@123.org






Re: providing the same input to more than one Map task

2011-04-25 Thread Kai Voigt
Hi,

I'd use the distributed cache to store the vector on every mapper machine 
locally.

Kai

Am 22.04.2011 um 21:15 schrieb Alexandra Anghelescu:

 Hi all,
 
 I am trying to perform matrix-vector multiplication using Hadoop.
 So I have matrix M in a file, and vector v in another file. How can I make
 it so that each Map task will get the whole vector v and a chunk of matrix
 M?
 Basically I want my map function to output key-value pairs (i,m[i,j]*v[j]),
 where i is the row number, and j the column number. And the reduce function
 will sum up all the values with the same key i, and that will be the ith
 element of my result vector.
 Or can you suggest another way to do it?
 
 
 Thanks,
 Alexandra Anghelescu

-- 
Kai Voigt
k...@123.org






Re: Best way to Merge small XML files

2011-02-03 Thread Kai Voigt
Did you look into Hadoop Archives?

http://hadoop.apache.org/mapreduce/docs/r0.21.0/hadoop_archives.html

Kai

Am 03.02.2011 um 11:44 schrieb madhu phatak:

 Hi
 You can write an InputFormat which create input splits from multiple files .
 It will solve your problem.
 
 On Wed, Feb 2, 2011 at 4:04 PM, Shuja Rehman shujamug...@gmail.com wrote:
 
 Hi Folks,
 
 I am having hundreds of small xml files coming each hour. The size varies
 from 5 Mb to 15 Mb. As Hadoop did not work well with small files so i want
 to merge these small files. So what is the best option to merge these xml
 files?
 
 
 
 --
 Regards
 Shuja-ur-Rehman Baig
 http://pk.linkedin.com/in/shujamughal
 

-- 
Kai Voigt
k...@123.org