Saumitra,

Two questions come to mind that could help you narrow down a solution:

1) How quickly do the downstream processes need the transformed data?
        Reason: If you can delay the processing for a period of time, enough to 
batch the data into a blob that is a multiple of your block size, then you are 
obviously going to be working more towards the strong suit of vanilla MR.

2) What else will be running on the cluster?
        Reason: If this is primarily setup for this use case then how often it 
runs / what resources it consumes when it does only needs to be optimized if it 
can't process them fast enough. If it is not then you could always setup a 
separate pool for this in the fairscheduler and allow for this to use a certain 
amount of overhead on the cluster when these events are being generated.

Outside of the fact that you would have a lot of small files on the cluster 
(which can be resolved by running a nightly job to blob them and then delete 
originals) I am not sure I would be too concerned about at least trying out 
this method. It would be helpful to know the size and type of data coming in as 
well as what type of operation you are looking to do if you would like a more 
concrete suggestion. Log data is a prime example of this type of workflow and 
there are many suggestions out there as well as projects that attempt to 
address this (i.e. Chukwa). 

HTH,
Matt

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Saumitra Shahapure
Sent: Friday, June 24, 2011 12:12 PM
To: [email protected]
Subject: Queue support from HDFS

Hi,

Is queue-like structure supported from HDFS where stream of data is
processed when it's generated?
Specifically, I will have stream of data coming; and data independent
operation needs to be applied to it (so only Map function, reducer is
identity).
I wish to distribute data among nodes using HDFS and start processing it as
it arrives, preferably in single MR job.

I agree that it can be done by starting new MR job for each batch of data,
but is starting many MR jobs frequently for small data chunks a good idea?
(Consider new batch arrives after every few sec and processing of one batch
takes few mins)

Thanks,
-- 
Saumitra S. Shahapure
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.

Reply via email to