Hadoop with SELinux?

2010-01-05 Thread Gibbon, Robert, VF-Group
Hello list

Can someone please tell me if it would be possible to run hadoop with SELinux 
enabled across the cluster? Are there any known issues or better, how2's I can 
be pointed at? Also interested in running iptables on the nodes - easy to do?

Many thanks in advance
Robert

Robert Gibbon
Solutions Architect
Integration Design  Solution Engineering

Vodafone Group Service GmbH
Mannesmannufer 2, D-40213 Düsseldorf, Germany
Amtsgericht Düsseldorf, HRB 53554
Geschäftsführung: Helmut Hoffmann, Dr. Joachim Peters




Re: Hadoop with SELinux?

2010-01-05 Thread Michael Thomas
We have used selinux on our large cluster with HDFS (we don't use MR). 
The only issue I've found is that the mount program does not have 
permission to execute java, which prohibits you from mounting the fuse 
filesystem from /etc/fstab.  This is fixed with the policy file below.


require {
type mount_t;
type shell_exec_t;
type proc_net_t;
type random_device_t;
type java_exec_t;
type fusefs_t;
class process { execstack execmem getsched setrlimit };
class tcp_socket { accept listen };
class chr_file read;
class file { execute read getattr execute_no_trans };
class dir { read getattr search };
}

#= mount_t ==
allow mount_t fusefs_t:dir { read getattr };
allow mount_t java_exec_t:file { read getattr execute execute_no_trans };
allow mount_t proc_net_t:dir search;
allow mount_t proc_net_t:file { read getattr };
allow mount_t random_device_t:chr_file read;
allow mount_t self:process { execstack execmem getsched setrlimit };
allow mount_t self:tcp_socket { accept listen };
allow mount_t shell_exec_t:file { read execute getattr execute_no_trans };


--Mike

On 01/05/2010 08:05 AM, Gibbon, Robert, VF-Group wrote:

Hello list

Can someone please tell me if it would be possible to run hadoop with SELinux 
enabled across the cluster? Are there any known issues or better, how2's I can 
be pointed at? Also interested in running iptables on the nodes - easy to do?

Many thanks in advance
Robert

Robert Gibbon
Solutions Architect
Integration Design  Solution Engineering

Vodafone Group Service GmbH
Mannesmannufer 2, D-40213 Düsseldorf, Germany
Amtsgericht Düsseldorf, HRB 53554
Geschäftsführung: Helmut Hoffmann, Dr. Joachim Peters








smime.p7s
Description: S/MIME Cryptographic Signature


heck if I'm subscribed to this list

2010-01-05 Thread psdc1978
Hi,

It's just to check if I'm subscribed to this list, because I posted
some questions and I haven't got yet a response.

-- 
Pedro


0.20.2 HDFS incompatible with 0.20.1

2010-01-05 Thread Todd Lipcon
Hey all,

In a recent discussion, we noticed that the 0.20.2 HDFS client will not be
wire-compatible with 0.20.0 or 0.20.1 due to the inclusion of HDFS-793
(required for HDFS-101). This begs a few questions:

1) Although we certainly do not guarantee wire compatibility between minor
versions (0.20 - 0.21) have we previously implied wire compatibility
between bugfix releases?

2) Is the above something we *should* be guaranteeing already?

3) If we haven't guaranteed the above, how many users think we have? (ie how
do we correctly call out this fact in the 0.20.2 release notes in such a way
that no one gets surprised). I can imagine plenty of organizations where a
lockstep upgrade between client and server is difficult, and we should make
sure that cluster operators know it will be necessary. Since it wasn't
necessary between 0.20.0 and 0.20.1, or various 0.18 releases, people may
have grown used to non-lockstep upgrades.

4) If the above are problems, would it be worth considering a patch for
branch-20 that provides a client that is compatible with either, based on
the datanode protocol version number of the server? It seems like a bit of
scary complexity, but wanted to throw it out there.

Thanks
-Todd


Re: 0.20.2 HDFS incompatible with 0.20.1

2010-01-05 Thread Allen Wittenauer
On 1/5/10 11:29 AM, Todd Lipcon t...@cloudera.com wrote:
 1) Although we certainly do not guarantee wire compatibility between minor
 versions (0.20 - 0.21) have we previously implied wire compatibility
 between bugfix releases?

IIRC, it has been implied and was a goal but not officially written anywhere
public that I know of.

 2) Is the above something we *should* be guaranteeing already?

A) From an ops perspective, the lack of compatibility between even minor
releases is a pain.

B) Most folks with even slightly complex environments are likely patching
Hadoop.  A good chunk of those are likely in ways that breaks compatibility.
[For example, we're working on a TCP buffer patch for HDFS to fix what we
suspect is a latency problem.  Does it break compat?  Maybe.]

 3) If we haven't guaranteed the above, how many users think we have? (ie how
 do we correctly call out this fact in the 0.20.2 release notes in such a way
 that no one gets surprised).

I suspect most folks don't even know that micros are incompatible until they
suddenly realize that distcp doesn't work.

 I can imagine plenty of organizations where a
 lockstep upgrade between client and server is difficult, and we should make
 sure that cluster operators know it will be necessary. Since it wasn't
 necessary between 0.20.0 and 0.20.1, or various 0.18 releases, people may
 have grown used to non-lockstep upgrades.

I can easily see many organizations only upgrading when it breaks due to the
Hadoop binaries being spread far and wide  and under the control of many
different departments without any sort of centralized management.

Despite the development model, I doubt few ever do just a mapred or hdfs
upgrade, so any change in the stack will likely trigger a full Hadoop
upgrade.

 4) If the above are problems, would it be worth considering a patch for
 branch-20 that provides a client that is compatible with either, based on
 the datanode protocol version number of the server? It seems like a bit of
 scary complexity, but wanted to throw it out there.

Everyone knows I don't mind playing devil's advocate :), so let me ask the
obvious question:

Bugs are bad, etc, etc, but is it so critical that it has to be in the 0.20
branch at all?

I'd rather see the community spend cycles on 0.21 than worrying about 0.20
given that we're fast approaching the 1yr birthday of 0.20.0



Hadoop User Group (Bay Area) - Jan 20th at Yahoo!

2010-01-05 Thread Dekel Tankel
Hi all,

Happy new year!

RSVP is now open for the first 2010 Bay Area Hadoop user group at the Yahoo! 
Sunnyvale Campus, planed for Jan 20th.

Registration is available here
http://www.meetup.com/hadoop/calendar/12229988/

Agenda will be posted soon.

Looking forward to seeing you there


Dekel



Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-05 Thread Xueling Shu
Hi Todd:

After finishing some tasks I finally get back to HDFS testing.

One question for your last reply to this thread: Are there any code examples
close to your second and third recommendations? Or what APIs I should start
with for my testing?

Thanks.
Xueling

On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Xueling,

 In that case, I would recommend the following:

 1) Put all of your data on HDFS
 2) Write a MapReduce job that sorts the data by position of match
 3) As a second output of this job, you can write a sparse index -
 basically a set of entries like this:

 position of match offset into file number of entries following

 where you're basically giving offsets into every 10K records or so. If
 you index every 10K records, then 5 billion total will mean 100,000
 index entries. Each index entry shouldn't be more than 20 bytes, so
 100,000 entries will be 2MB. This is super easy to fit into memory.
 (you could probably index every 100th record instead and end up with
 200MB, still easy to fit in memory)

 Then to satisfy your count-range query, you can simply scan your
 in-memory sparse index. Some of the indexed blocks will be completely
 included in the range, in which case you just add up the number of
 entries following column. The start and finish block will be
 partially covered, so you can use the file offset info to load that
 file off HDFS, start reading at that offset, and finish the count.

 Total time per query should be 100ms no problem.

 -Todd

 On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu x...@systemsbiology.org
 wrote:
  Hi Todd:
 
  Thank you for your reply.
 
  The datasets wont be updated often. But the query against a data set is
  frequent. The quicker the query, the better. For example we have done
  testing on a Mysql database (5 billion records randomly scattered into 24
  tables) and the slowest query against the biggest table (400,000,000
  records) is around 12 mins. So if using any Hadoop product can speed up
 the
  search then the product is what we are looking for.
 
  Cheers,
  Xueling
 
  On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon t...@cloudera.com wrote:
 
  Hi Xueling,
 
  One important question that can really change the answer:
 
  How often does the dataset change? Can the changes be merged in in
  bulk every once in a while, or do you need to actually update them
  randomly very often?
 
  Also, how fast is quick? Do you mean 1 minute, 10 seconds, 1 second,
 or
  10ms?
 
  Thanks
  -Todd
 
  On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu x...@systemsbiology.org
  wrote:
Hi there:
  
   I am researching Hadoop to see which of its products suits our need
 for
   quick queries against large data sets (billions of records per set)
  
   The queries will be performed against chip sequencing data. Each
 record
  is
   one line in a file. To be clear below shows a sample record in the
 data
  set.
  
  
   one line (record) looks like: 1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC
 U2
  0 0
   1 4 *103570835* F .. 23G 24C
  
   The highlighted field is called position of match and the query we
 are
   interested in is the # of sequences in a certain range of this
 position
  of
   match. For instance the range can be position of match  200 and
   position of match + 36  200,000.
  
   Any suggestions on the Hadoop product I should start with to
 accomplish
  the
   task? HBase,Pig,Hive, or ...?
  
   Thanks!
  
   Xueling
  
 
 



Re: Three questions about Hadoop

2010-01-05 Thread Andrew Wang
Hi Annie,

2010/1/5 qin.wang qin.w...@i-soft.com.cn

 Hi team,



 When I try to do some research on Hadoop, I have several high level
 questions, if any comments from you it will do great help for me:



 1. Hadoop assumes the files are big files, but take Google as an example,
 it
 seems the google result for user are small files, so how to understand the
 big files?And what’s the file content for example?

 I think the big files means very large file (bigger than 64MB). Hadoop
use the HDFS as Distributed filesystem, the user log  web log etc are
stored in HDFS, The engineers can use Hadoop to do analysis on the logs.
Anyway, i don't know whether Google puts it's web pags in the distributed
filesystem like this.


 2. Why are the files write-once and read-many times?


 As  mentioned in last section, the logs are stored in HDFS, these log are
write-once and alway used by engineers for severl times.


 3. How to install other softwares on Hadoop, is there any special
 requirements for the software? Do they need to support the Map/Reduce
 module
 and then can be installed?

 I just don't know what you mean, maybe you would like add additional jar
which used in your application, if so, distributed cache in hadoop will
help you.

Good Luck!


 It will be very appreciated for your help.



 王 琴  Annie.Wang



 上海市徐汇区桂林路418号7号楼6楼
 Zip code: 200 233
 Tel:  +86 21 5497 8666-8004
 Fax: +86 21 5497 7986
 Mobile:  +86 137 6108 8369






-- 
http://anqiang1900.blog.163.com/