Hadoop with SELinux?
Hello list Can someone please tell me if it would be possible to run hadoop with SELinux enabled across the cluster? Are there any known issues or better, how2's I can be pointed at? Also interested in running iptables on the nodes - easy to do? Many thanks in advance Robert Robert Gibbon Solutions Architect Integration Design Solution Engineering Vodafone Group Service GmbH Mannesmannufer 2, D-40213 Düsseldorf, Germany Amtsgericht Düsseldorf, HRB 53554 Geschäftsführung: Helmut Hoffmann, Dr. Joachim Peters
Re: Hadoop with SELinux?
We have used selinux on our large cluster with HDFS (we don't use MR). The only issue I've found is that the mount program does not have permission to execute java, which prohibits you from mounting the fuse filesystem from /etc/fstab. This is fixed with the policy file below. require { type mount_t; type shell_exec_t; type proc_net_t; type random_device_t; type java_exec_t; type fusefs_t; class process { execstack execmem getsched setrlimit }; class tcp_socket { accept listen }; class chr_file read; class file { execute read getattr execute_no_trans }; class dir { read getattr search }; } #= mount_t == allow mount_t fusefs_t:dir { read getattr }; allow mount_t java_exec_t:file { read getattr execute execute_no_trans }; allow mount_t proc_net_t:dir search; allow mount_t proc_net_t:file { read getattr }; allow mount_t random_device_t:chr_file read; allow mount_t self:process { execstack execmem getsched setrlimit }; allow mount_t self:tcp_socket { accept listen }; allow mount_t shell_exec_t:file { read execute getattr execute_no_trans }; --Mike On 01/05/2010 08:05 AM, Gibbon, Robert, VF-Group wrote: Hello list Can someone please tell me if it would be possible to run hadoop with SELinux enabled across the cluster? Are there any known issues or better, how2's I can be pointed at? Also interested in running iptables on the nodes - easy to do? Many thanks in advance Robert Robert Gibbon Solutions Architect Integration Design Solution Engineering Vodafone Group Service GmbH Mannesmannufer 2, D-40213 Düsseldorf, Germany Amtsgericht Düsseldorf, HRB 53554 Geschäftsführung: Helmut Hoffmann, Dr. Joachim Peters smime.p7s Description: S/MIME Cryptographic Signature
heck if I'm subscribed to this list
Hi, It's just to check if I'm subscribed to this list, because I posted some questions and I haven't got yet a response. -- Pedro
0.20.2 HDFS incompatible with 0.20.1
Hey all, In a recent discussion, we noticed that the 0.20.2 HDFS client will not be wire-compatible with 0.20.0 or 0.20.1 due to the inclusion of HDFS-793 (required for HDFS-101). This begs a few questions: 1) Although we certainly do not guarantee wire compatibility between minor versions (0.20 - 0.21) have we previously implied wire compatibility between bugfix releases? 2) Is the above something we *should* be guaranteeing already? 3) If we haven't guaranteed the above, how many users think we have? (ie how do we correctly call out this fact in the 0.20.2 release notes in such a way that no one gets surprised). I can imagine plenty of organizations where a lockstep upgrade between client and server is difficult, and we should make sure that cluster operators know it will be necessary. Since it wasn't necessary between 0.20.0 and 0.20.1, or various 0.18 releases, people may have grown used to non-lockstep upgrades. 4) If the above are problems, would it be worth considering a patch for branch-20 that provides a client that is compatible with either, based on the datanode protocol version number of the server? It seems like a bit of scary complexity, but wanted to throw it out there. Thanks -Todd
Re: 0.20.2 HDFS incompatible with 0.20.1
On 1/5/10 11:29 AM, Todd Lipcon t...@cloudera.com wrote: 1) Although we certainly do not guarantee wire compatibility between minor versions (0.20 - 0.21) have we previously implied wire compatibility between bugfix releases? IIRC, it has been implied and was a goal but not officially written anywhere public that I know of. 2) Is the above something we *should* be guaranteeing already? A) From an ops perspective, the lack of compatibility between even minor releases is a pain. B) Most folks with even slightly complex environments are likely patching Hadoop. A good chunk of those are likely in ways that breaks compatibility. [For example, we're working on a TCP buffer patch for HDFS to fix what we suspect is a latency problem. Does it break compat? Maybe.] 3) If we haven't guaranteed the above, how many users think we have? (ie how do we correctly call out this fact in the 0.20.2 release notes in such a way that no one gets surprised). I suspect most folks don't even know that micros are incompatible until they suddenly realize that distcp doesn't work. I can imagine plenty of organizations where a lockstep upgrade between client and server is difficult, and we should make sure that cluster operators know it will be necessary. Since it wasn't necessary between 0.20.0 and 0.20.1, or various 0.18 releases, people may have grown used to non-lockstep upgrades. I can easily see many organizations only upgrading when it breaks due to the Hadoop binaries being spread far and wide and under the control of many different departments without any sort of centralized management. Despite the development model, I doubt few ever do just a mapred or hdfs upgrade, so any change in the stack will likely trigger a full Hadoop upgrade. 4) If the above are problems, would it be worth considering a patch for branch-20 that provides a client that is compatible with either, based on the datanode protocol version number of the server? It seems like a bit of scary complexity, but wanted to throw it out there. Everyone knows I don't mind playing devil's advocate :), so let me ask the obvious question: Bugs are bad, etc, etc, but is it so critical that it has to be in the 0.20 branch at all? I'd rather see the community spend cycles on 0.21 than worrying about 0.20 given that we're fast approaching the 1yr birthday of 0.20.0
Hadoop User Group (Bay Area) - Jan 20th at Yahoo!
Hi all, Happy new year! RSVP is now open for the first 2010 Bay Area Hadoop user group at the Yahoo! Sunnyvale Campus, planed for Jan 20th. Registration is available here http://www.meetup.com/hadoop/calendar/12229988/ Agenda will be posted soon. Looking forward to seeing you there Dekel
Re: Which Hadoop product is more appropriate for a quick query on a large data set?
Hi Todd: After finishing some tasks I finally get back to HDFS testing. One question for your last reply to this thread: Are there any code examples close to your second and third recommendations? Or what APIs I should start with for my testing? Thanks. Xueling On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon t...@cloudera.com wrote: Hi Xueling, In that case, I would recommend the following: 1) Put all of your data on HDFS 2) Write a MapReduce job that sorts the data by position of match 3) As a second output of this job, you can write a sparse index - basically a set of entries like this: position of match offset into file number of entries following where you're basically giving offsets into every 10K records or so. If you index every 10K records, then 5 billion total will mean 100,000 index entries. Each index entry shouldn't be more than 20 bytes, so 100,000 entries will be 2MB. This is super easy to fit into memory. (you could probably index every 100th record instead and end up with 200MB, still easy to fit in memory) Then to satisfy your count-range query, you can simply scan your in-memory sparse index. Some of the indexed blocks will be completely included in the range, in which case you just add up the number of entries following column. The start and finish block will be partially covered, so you can use the file offset info to load that file off HDFS, start reading at that offset, and finish the count. Total time per query should be 100ms no problem. -Todd On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu x...@systemsbiology.org wrote: Hi Todd: Thank you for your reply. The datasets wont be updated often. But the query against a data set is frequent. The quicker the query, the better. For example we have done testing on a Mysql database (5 billion records randomly scattered into 24 tables) and the slowest query against the biggest table (400,000,000 records) is around 12 mins. So if using any Hadoop product can speed up the search then the product is what we are looking for. Cheers, Xueling On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon t...@cloudera.com wrote: Hi Xueling, One important question that can really change the answer: How often does the dataset change? Can the changes be merged in in bulk every once in a while, or do you need to actually update them randomly very often? Also, how fast is quick? Do you mean 1 minute, 10 seconds, 1 second, or 10ms? Thanks -Todd On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu x...@systemsbiology.org wrote: Hi there: I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set) The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data set. one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 *103570835* F .. 23G 24C The highlighted field is called position of match and the query we are interested in is the # of sequences in a certain range of this position of match. For instance the range can be position of match 200 and position of match + 36 200,000. Any suggestions on the Hadoop product I should start with to accomplish the task? HBase,Pig,Hive, or ...? Thanks! Xueling
Re: Three questions about Hadoop
Hi Annie, 2010/1/5 qin.wang qin.w...@i-soft.com.cn Hi team, When I try to do some research on Hadoop, I have several high level questions, if any comments from you it will do great help for me: 1. Hadoop assumes the files are big files, but take Google as an example, it seems the google result for user are small files, so how to understand the big files?And what’s the file content for example? I think the big files means very large file (bigger than 64MB). Hadoop use the HDFS as Distributed filesystem, the user log web log etc are stored in HDFS, The engineers can use Hadoop to do analysis on the logs. Anyway, i don't know whether Google puts it's web pags in the distributed filesystem like this. 2. Why are the files write-once and read-many times? As mentioned in last section, the logs are stored in HDFS, these log are write-once and alway used by engineers for severl times. 3. How to install other softwares on Hadoop, is there any special requirements for the software? Do they need to support the Map/Reduce module and then can be installed? I just don't know what you mean, maybe you would like add additional jar which used in your application, if so, distributed cache in hadoop will help you. Good Luck! It will be very appreciated for your help. 王 琴 Annie.Wang 上海市徐汇区桂林路418号7号楼6楼 Zip code: 200 233 Tel: +86 21 5497 8666-8004 Fax: +86 21 5497 7986 Mobile: +86 137 6108 8369 -- http://anqiang1900.blog.163.com/