Re: Design question
Ant suggestion or pointers would be helpful. Are there any best practices? On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given functionality we get data feed continuously writen to sequencefile, that is then coverted to more structured format using map reduce and stored in tab separated files. For such continuous feed what's the best way to organize directories and the names? Should it be just based of timestamp or something better that helps in organizing data. Second part of question, is it better to store output in sequence files so that we can take advantage of compression per record. This seems to be required since gzip/snappy compression of entire file would launch only one map tasks. And the last question, when compressing a flat file should it first be split into multiple files so that we get multiple mappers if we need to run another job on this file? LZO is another alternative but then it requires additional configuration, is it preferred? Any articles or suggestions would be very helpful.
Design question
I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given functionality we get data feed continuously writen to sequencefile, that is then coverted to more structured format using map reduce and stored in tab separated files. For such continuous feed what's the best way to organize directories and the names? Should it be just based of timestamp or something better that helps in organizing data. Second part of question, is it better to store output in sequence files so that we can take advantage of compression per record. This seems to be required since gzip/snappy compression of entire file would launch only one map tasks. And the last question, when compressing a flat file should it first be split into multiple files so that we get multiple mappers if we need to run another job on this file? LZO is another alternative but then it requires additional configuration, is it preferred? Any articles or suggestions would be very helpful.
Re: Hbase + mapreduce -- operational design question
I believe HBase has some kind of TTL (timeout-based expiry) for records and it can clean them up on its own. On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay chinm...@qualcomm.com wrote: Hello, I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports. Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper. One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems.. 1) Clients may have inserted events while the job was running. 2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability. What do people using Hbase (live) + mapreduce typically do. ? Thanks! Chinmay -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
Re: Hbase + mapreduce -- operational design question
Chinmay, how are you configuring your job? Have you checked using setScan and selecting the keys you care to run MR over? See http://ofps.oreilly.com/titles/9781449396107/mapreduce.html As a shameless plug - For your reports, see if you want to leverage Crux: https://github.com/sonalgoyal/crux Best Regards, Sonal Crux: Reporting for HBase https://github.com/sonalgoyal/crux Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Sep 10, 2011 at 2:53 PM, Eugene Kirpichov ekirpic...@gmail.comwrote: I believe HBase has some kind of TTL (timeout-based expiry) for records and it can clean them up on its own. On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay chinm...@qualcomm.com wrote: Hello, I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports. Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper. One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems.. 1) Clients may have inserted events while the job was running. 2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability. What do people using Hbase (live) + mapreduce typically do. ? Thanks! Chinmay -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
Hbase + mapreduce -- operational design question
Hello, I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports. Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper. One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems.. 1) Clients may have inserted events while the job was running. 2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability. What do people using Hbase (live) + mapreduce typically do. ? Thanks! Chinmay
Job design question
Hi all, I'm trying to design an MR job for processing walks-on-graph data from database. The idea is that I have a list of random walks on a graph (which is unknown). I have two tables (walk ids and hops): - the first holds the list of random-walk ids, one row per walk, each is unique id (increasing). - the second holds, for each walk (identified by the uid) the list of hops (vertices) traversed in the walk (one hop per row) -- these two tables are in a one-to-many structure, with the walk uid used as a foreign key in the hops table. Meaning, walks should be split between nodes but hops per walk must not. How would you suggest handling this structure? is it even possible with DBInputFormat? Second, assuming it is possible to have this split in an MR job, I would like to have different reducers that operate on the data during reading (I want to avoid multiple reading since it can take a long time). For example, one Reducer should create the actual graph: (Source Node,Dest Node)--(num_walks). Another one should create a length analysis: (Origin Node, Final Node)--distance etc. Any comments and thoughts will help! Thanks. -- View this message in context: http://www.nabble.com/Job-design-question-tp25076132p25076132.html Sent from the Hadoop core-user mailing list archive at Nabble.com.