Re: Creating Lucene index in Hadoop

2009-03-17 Thread Doug Cutting

Ning Li wrote:

1 is good. But for 2:
  - Won't it have a security concern as well? Or is this not a general
local cache?


A client-side RAM cache would be filled through the same security 
mechanisms as all other filesystem accesses.



  - You are referring to caching in RAM, not caching in local FS,
right? In general, a Lucene index size could be quite large. We may
have to cache a lot of data to reach a reasonable hit ratio...


Lucene on a local disk benefits significantly from the local 
filesystem's RAM cache (aka the kernel's buffer cache).  HDFS has no 
such local RAM cache outside of the stream's buffer.  The cache would 
need to be no larger than the kernel's buffer cache to get an equivalent 
hit ratio.  And if you're accessing a remote index then you shouldn't 
also need a large buffer cache.


Doug


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Doug Cutting

Ning Li wrote:

With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.


I don't think HADOOP-4801 is required.  It would help, certainly, but 
it's so fraught with security and other issues that I doubt it will be 
committed anytime soon.


What would probably help HDFS random access performance for Lucene 
significantly would be:
 1. A cache of connections to datanodes, so that each seek() does not 
require an open().  If we move HDFS data transfer to be RPC-based (see, 
e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will 
come for free, since RPC already caches connections.  We hope to do this 
for Hadoop 1.0, so that we use a single transport for all Hadoop's core 
operations, to simplify security.
 2. A local cache of read-only HDFS data, equivalent to kernel's buffer 
cache.  This might be implemented as a Lucene Directory that keeps an 
LRU cache of buffers from a wrapped filesystem, perhaps a subclass of 
RAMDirectory.


With these, performance would still be slower than a local drive, but 
perhaps not so dramatically.


Doug


Re: about block size

2009-03-12 Thread Doug Cutting
One factor is that block size should minimize the impact of disk seeks. 
 For example, if a disk seeks in 10ms and transfers at 100MB/s, then a 
good block size will be substantially larger than 1MB.  With 100MB 
blocks, seeks would only slow things by 1%.


Another factor is that, unless files are smaller than the block size, 
larger blocks means fewer blocks, and fewer blocks make for a more 
efficient namenode.


The primary harm of too large blocks is that you will end up with fewer 
map tasks than nodes, and not use your cluster optimally.


Doug

ChihChun Chu wrote:

Hi,

I have a question about how to decide the block size.
As I understanding, the block size is related to namenode's heap size(how
many blocks can be handled),
total storage capacity of clusters, the files size (depends on applications,
e.g. 1T log file), #of replicas,
and the performance of mapreduce.
In Google's paper, they used 64MB as their block size. Yahoo and Facebook
seems set block size
to 128MB. Hadoop default value is 64MB. I don't know why 64MB or 128MB. Is
that the result from the tradeoff
as I mentioned above? How do I decide the block size if I want to build my
application upon Hadoop? Is their
any criterion or formula?

Any opinions or comments will be appreciate.


stchu



Re: Not a host:port pair when running balancer

2009-03-11 Thread Doug Cutting

Konstantin Shvachko wrote:

The port was not specified at all in the original configuration.


Since 0.18, the port is optional.  If no port is specified, then 8020 is 
used.  8020 is the default port for namenodes.


https://issues.apache.org/jira/browse/HADOOP-3317

Doug


Re: Not a host:port pair when running balancer

2009-03-11 Thread Doug Cutting

Konstantin Shvachko wrote:

Clarifying: port # is missing in your configuration, should be

  fs.default.name
  hdfs://hvcwydev0601:8020


where 8020 is your port number.


That's the work-around, but it's a bug.  One should not need to specify 
the default port number (8020).  Please file an issue in Jira.


Doug


Re: Does "hadoop-default.xml" + "hadoop-site.xml" matter for whole cluster or each node?

2009-03-09 Thread Doug Cutting

Owen O'Malley wrote:
It depends on the property whether they come from the job's 
configuration or the system's. Some  like io.sort.mb and 
mapred.map.tasks come from the job, while others like 
mapred.tasktracker.map.tasks.maximum come from the system.


There is some method to the madness.

Things that are only set programmatically, like most job parameters, 
e.g, the mapper, reducer, etc, are not listed in hadoop-default.xml, 
since they don't make sense to configure cluster-wide.


Defaults are overidden by hadoop-site.xml, but a job can then override 
hadoop-site.xml unless hadoop-site.xml declares it to be final, in which 
case any value specified in a job is ignored.


There are a few odd cases of things that jobs might want to override but 
they cannot.  For example, a job might wish to override 
mapred.tasktracker.map.tasks.maximum, but, if you think a bit more, this 
is read by the tasktracker at startup and cannot be reasonably changed 
per job, since a tasktracker can run tasks from different jobs 
simultaneously.


So things that make sense per-job and are not declared final in your 
hadoop-site.xml can generally be overridden by the job.


Doug


Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Doug Cutting

Ian Swett wrote:

We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy 
to use and faster than any other option.


I also use Jackson and recommend it.

Doug


Re: How does NVidia GPU compare to Hadoop/MapReduce

2009-02-27 Thread Doug Cutting

I think they're complementary.

Hadoop's MapReduce lets you run computations on up to thousands of 
computers potentially processing petabytes of data.  It gets data from 
the grid to your computation, reliably stores output back to the grid, 
and supports grid-global computations (e.g., sorting).


CUDA can make computations on a single computer run faster by using its 
GPU.  It does not handle co-ordination of multiple computers, e.g., the 
flow of data in and out of a distributed filesystem, distributed 
reliability, global computations, etc.


So you might use CUDA within mapreduce to more efficiently run 
compute-intensive tasks over petabytes of data.


Doug

Mark Kerzner wrote:

Hi, this from Dr. Dobbs caught my attention, 240 CPU for $1,700

http://www.ddj.com/focal/NVIDIA-CUDA

What are your thoughts?

Thank you,
Mark



Re: FileInputFormat directory traversal

2009-02-03 Thread Doug Cutting

Hi, Ian.

One reason is that a MapFile is represented by a directory containing 
two files named "index" and "data".  SequenceFileInputFormat handles 
MapFiles too by, if an input file is a directory containing a data file, 
using that file.


Another reason is that's what reduces generate.

Neither reason implies that this is the best or only way of doing 
things.  It would probably be better if FileInputFormat optionally 
supported recursive file enumeration.  (It would be incompatible and 
thus cannot be the default mode.)


Please file an issue in Jira for this and attach your patch.

Thanks,

Doug

Ian Soboroff wrote:
Is there a reason FileInputFormat only traverses the first level of 
directories in its InputPaths?  (i.e., given an InputPath of 'foo', it 
will get foo/* but not foo/bar/*).


I wrote a full depth-first traversal in my custom InputFormat which I 
can offer as a patch.  But to do it I had to duplicate the PathFilter 
classes in FileInputFormat which are marked private, so a mainline patch 
would also touch FileInputFormat.


Ian



Re: Question about HDFS capacity and remaining

2009-01-30 Thread Doug Cutting

Bryan Duxbury wrote:
Hm, very interesting. Didn't know about that. What's the purpose of the 
reservation? Just to give root preference or leave wiggle room?


I think it's so that, when the disk is full, root processes don't fail, 
only user processes.  So you don't lose, e.g., syslog.  With modern 
disks, 5% is too much, especially for volumes that are only used for 
user data.  You can safely set this to 1%.


Doug


Re: Question about HDFS capacity and remaining

2009-01-29 Thread Doug Cutting
Ext2 by default reserves 5% of the drive for use by root only.  That'd 
be 45MB of your 907GB capacity which would account for most of the 
discrepancy.  You can adjust this with tune2fs.


Doug

Bryan Duxbury wrote:

There are no non-dfs files on the partitions in question.

df -h indicates that there is 907GB capacity, but only 853GB remaining, 
with 200M used. The only thing I can think of is the filesystem overhead.


-Bryan

On Jan 29, 2009, at 4:06 PM, Hairong Kuang wrote:


It's taken by non-dfs files.

Hairong


On 1/29/09 3:23 PM, "Bryan Duxbury"  wrote:


Hey all,

I'm currently installing a new cluster, and noticed something a
little confusing. My DFS is *completely* empty - 0 files in DFS.
However, in the namenode web interface, the reported "capacity" is
3.49 TB, but the "remaining" is 3.25TB. Where'd that .24TB go? There
are literally zero other files on the partitions hosting the DFS data
directories. Where am I losing 240GB?

-Bryan






Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting

Mark Kerzner wrote:

Okay, I am convinced. I only noticed that Doug, the originator, was not
happy about it - but in open source one has to give up control sometimes.


I think perhaps you misunderstood my remarks.  My point was that, if you 
looked to Nutch's Content class for an example, it is, for historical 
reasons, somewhat more complicated than it needs to be and is thus a 
less than perfect example.  But using SequenceFile to store web content 
is certainly a best practice and I did not mean to imply otherwise.


Doug


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting

Philip (flip) Kromer wrote:

Heretrix ,
Nutch,
others use the ARC file format
  http://www.archive.org/web/researcher/ArcFileFormat.php
  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml


Nutch does not use ARC format but rather uses Hadoop's SequenceFile to 
store crawled pages.  The keys of crawl content files are URLs and the 
values are:


http://lucene.apache.org/nutch/apidocs/org/apache/nutch/protocol/Content.html

I believe that the implementation of this class pre-dates SequenceFile's 
support for compressed values, so the values are decompressed on demand, 
which needlessly complicates its implementation and API.  It's basically 
a Writable that stores binary content plus headers, typically an HTTP 
response.


Doug


Re: using distcp for http source files

2009-01-23 Thread Doug Cutting
Can you please attach your latest version of this to 
https://issues.apache.org/jira/browse/HADOOP-496?


Thanks,

Doug

Boris Musykantski wrote:

we have fixed  some patches in JIRA for support of webdav server on
top of HDFS, updated to work with newer version (0.18.0 IIRC) and
added support for permissions.  See code and description here:

http://www.hadoop.iponweb.net/Home/hdfs-over-webdav

Hope it is useful,

Regards,
Boris, IPonWeb

On Thu, Jan 22, 2009 at 2:30 PM, Doug Cutting  wrote:

Aaron Kimball wrote:

Is anyone aware of an OSS web dav library that
could be wrapped in a FileSystem implementation?

We'd need a Java WebDAV client to talk to foreign filesystems.  But to
expose HDFS to foreign filesystems (i.e., to better support mounting HDFS)
we'd need a Java WebDAV server, like http://milton.ettrema.com/.

Doug



Re: using distcp for http source files

2009-01-22 Thread Doug Cutting

Aaron Kimball wrote:

Is anyone aware of an OSS web dav library that
could be wrapped in a FileSystem implementation?


We'd need a Java WebDAV client to talk to foreign filesystems.  But to 
expose HDFS to foreign filesystems (i.e., to better support mounting 
HDFS) we'd need a Java WebDAV server, like http://milton.ettrema.com/.


Doug


Re: using distcp for http source files

2009-01-22 Thread Doug Cutting

Aaron Kimball wrote:

Doesn't the WebDAV protocol use http for file transfer, and support reads /
writes / listings / etc?


Yes.  Getting a WebDAV-based FileSystem in Hadoop has long been a goal. 
 It could replace libhdfs, since there are already a WebDav-based FUSE 
filesystem for Linux (wdfs, davfs2).  WebDAV is also mountable from 
Windows, etc.



Is anyone aware of an OSS web dav library that
could be wrapped in a FileSystem implementation?


Yes, Apache Slide does but it's dead.  Apache Jackrabbit also does and 
it is alive (http://jackrabbit.apache.org/).


Doug


Re: using distcp for http source files

2009-01-21 Thread Doug Cutting

Derek Young wrote:
Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like 
this should be supported, but the http URLs are not working for me.  Are 
http source URLs still supported?


No.  They used to be supported, but when distcp was converted to accept 
any Path this stopped working, since there is no FileSystem 
implementation mapped to http: paths.  Implementing an HttpFileSystem 
that supports read-only access to files and no directory listings is 
fairly trivial, but without directory listings, distcp would not work well.


https://issues.apache.org/jira/browse/HADOOP-1563 includes a now 
long-stale patch that implements an HTTP filesystem, where directory 
listings are implemented, assuming that:

  - directories are represented by slash-terminated urls;
  - GET of a directory contains the URLs of its children
This works for the directory listings returned by many HTTP servers.

Perhaps someone can update this patch, and, if folks find it useful, we 
can include it.


Doug


Re: Auditing and accounting with Hadoop

2009-01-07 Thread Doug Cutting
The notion of a client/task ID, independent of IP or username seems 
useful for log analysis.  DFS's client ID is probably currently your 
best bet, but we might improve its implementation, and make the notion 
more generic.


It is currently implemented as:

String taskId = conf.get("mapred.task.id");
if (taskId != null) {
  this.clientName = "DFSClient_" + taskId;
} else {
  this.clientName = "DFSClient_" + r.nextInt();
}

This hardwires a mapred dependency, which is fragile, and it's fairly 
useless outside of mapreduce, degenerating to a random number.


Rather we should probably have a configuration property that's 
explicitly used to indicate the user-level task, that's different than 
the username, IP, etc.  For MapReduce jobs this could default to the 
job's ID, but applications might override it.  So perhaps we could add 
static methods FileSystem.{get,set}TaskId(Configuration), then change 
the logging code to use this?


What do others think?

Doug

Brian Bockelman wrote:

Hey,

One of our charges is to do auditing and accounting with our file 
systems (we use the simplifying assumption that the users are 
non-malicious).


Auditing can be done by going through the namenode logs and utilizing 
the UGI information to track opens/reads/writes back to the users.  
Accounting can be done by adding up the byte counts from the datanode 
traces (or via the lovely metrics interfaces).  However, joining them 
together appears to be impossible!  The namenode audits record 
originating IP and UGI; the datanode audits contain the originating IP 
and DFSClient ID.  With 8 clients (and possibly 8 users) opening 
multiple files all from the same IP, it becomes a mess to untangle.


For example, in other filesystems, we've been able to construct a 
database with one row representing a file access from open-to-close.  We 
record the username, amount of time the file was open, number of bytes 
read, the remote IP, and the server which served the file (previous 
filesystem saved an entire file on server, not blocks).  Already, that 
model quickly is problematic as several servers take part in serving the 
file to the client.  The depressing, horrible file access pattern (Worse 
than random!  To read a 1MB record entirely with a read-buffer size of 
10MB, you can possibly read up to 2GB) of some jobs means that recording 
each read is not practical.


I'd like to record audit records and transfer accounting (at some level) 
into the DB.  Does anyone have any experience in doing this?  It seems 
that, if I can add the DFSClient ID into the namenode logs, I can record:
1) Each open (but miss the corresponding close) of a file at the 
namenode, along with the UGI, timestamp, IP
2) Each read/write on a datanode records the datanode, remote IP, 
DFSClient, bytes written/read, (but I miss the overall transaction 
time!  Possibly could be logged).  Don't record the block ID, as I can't 
map block ID -> file name in a cheap/easy manner (I'd have to either do 
this synchronously, causing a massive performance hit -- or do this 
asynchronously, and trip up over any files which were deleted after they 
were read).


This would allow me to see who is accessing what files, and how much 
that client is reading - but not necessarily which files they read from, 
if the same client ID is used for multiple files.  This also will allow 
me to trace reads back to specific users (so I can tell who has the 
worst access patterns and beat them).


So, my questions are:
a) Is anyone doing anything remotely similar which I can reuse?
b) Is there some hole in my logic which would render the approach useless?
c) Is my approach reasonable?  I.e., should I really be looking at 
inserting hooks into the DFSClient, as that's the only thing which can 
tell me information like "when did the client close the file?"?


Advise is welcome.

Brian


Re: Map input records(on JobTracker website) increasing and decreasing

2009-01-05 Thread Doug Cutting

Values can drop if tasks die and must be re-run.

Doug

Aaron Kimball wrote:

The actual number of input records is most likely steadily increasing. The
counters on the web site are inaccurate until the job is complete; their
values will fluctuate wildly. I'm not sure why this is.

- Aaron

On Mon, Jan 5, 2009 at 8:34 AM, Saptarshi Guha wrote:


Hello,
When I check the job tracker web page, and look at the Map Input
records read,the map input records goes up to say 1.4MN and then drops
to 410K and then goes up again.
The same happens with input/output bytes and output records.

Why is this? Is there something wrong with the mapper code? In my map
function, i assume I have received one line of input.
The oscillatory behavior does not occur for tiny datasets, but for 1GB
of data (tiny for others) i see this happening.

Thank s
Saptarshi
--
Saptarshi Guha - saptarshi.g...@gmail.com





Re: ssh problem

2009-01-05 Thread Doug Cutting
Ubuntu does not include the ssh server in client installations, so you 
need to install it yourself.


sudo apt-get install openssh-server

Doug

vinayak katkar wrote:

Hey
When I tried to install hadoop in ubuntu 8.04 I got an error ssh connection
refused to localhost at port 22.
Please any one can tell me the solution.
Thanks



Re: Hadoop corrupting files if file block size is 4GB and file size is 2GB

2008-12-22 Thread Doug Cutting
Why are you using such a big block size?  I suspect this problem will go 
away if you decrease your blocksize to less than 2GB.


This sounds like a bug, probably related to integer overflow: some part 
of Hadoop is using an 'int' where it should be using a 'long'.  Please 
file an issue in Jira, ideally with a test case.


A short-term fix might be to simply prohibit block sizes greater than 
2GB: that's what most folks are using, and that's what's tested, so 
that's effectively all that's supported.  If we incorporate tests for 
larger block sizes and fix this bug, then we might remove such a 
restriction.


Doug

Juho Mäkinen wrote:

I have been storing log data into hdfs cluster (just one datanode at
this moment) with 4GB as block size. It worked fine at the beginning
but now my individual file sizes have grown over 2GB and I cannot
access the files from HDFS cluster anymore. This seems to be happening
if the file size is over 2GB. All files which are under 2GB work fine.
There has been always enough disk space and time doesn't seem to be a
factor (for example 2008-11-24 doesn't work, but 2008-12-05 works)

"hadoop dfs -lsr /events/eventlog"
-rw-r--r--   1 garo supergroup 2177143062 2008-11-25 04:04
/events/eventlog/eventlog-2008-11-24 (doesn't work)
-rw-r--r--   1 garo supergroup 2121109956 2008-12-06 04:04
/events/eventlog/eventlog-2008-12-05 (works)

Note that 2008-12-05 filesize is less than 2^31 but 2008-11-24 is
larger than 2^31 (2 GB)


Example:
[g...@postmetal tmp]$ hadoop dfs -get /events/eventlog/eventlog-2008-11-24 .
get: null

Error log:
==> hadoop-garo-datanode-postmetal.pri.log <==
2008-12-22 10:52:12,325 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(127.0.0.1:50010,
storageID=DS-1049869337-10.157.67.82-50010-1221647796455,
infoPort=50075, ipcPort=50020):DataXceiver:
java.lang.IndexOutOfBoundsException
at java.io.DataInputStream.readFully(DataInputStream.java:175)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1821)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
at 
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
at java.lang.Thread.run(Thread.java:619)

Datanode web interface for url:
http://postmetal.pri:50075/browseBlock.jsp?blockId=-7907060692488773710&blockSize=2177143062&genstamp=6286&filename=/events/eventlog/eventlog-2008-11-24&datanodePort=50010&namenodeInfoPort=50070
displays this:

Total number of blocks: 1
-7907060692488773710:   127.0.0.1:50010

Is this a known problem? Has hadoop ever been tested with block sizes
over 2GB? Are my files corrupted (I do have working backups in
non-hadoop system). If this is the case and hadoop doesn't support
such big block sizes then there should be a clear error message when
trying to add files with big block sizes. Or is the problem not in
block size but in some other place?

 - Juho Mäkinen


Re: [video] visualization of the hadoop code history

2008-12-16 Thread Doug Cutting

Owen O'Malley wrote:
It is interesting, but it would be more interesting to track the authors 
of the patch rather than the committer. The two are rarely the same. 


Indeed.  There was a period of over a year where I wrote hardly anything 
but committed almost everything.  So I am vastly overrepresented in commits.


Doug


Re: File loss at Nebraska

2008-12-09 Thread Doug Cutting

Steve Loughran wrote:
Alternatively, "why we should be exploring the configuration space more 
widely"


Are you volunteering?

Doug


Re: File loss at Nebraska

2008-12-08 Thread Doug Cutting

Brian Bockelman wrote:
To some extent, this whole issue is caused because we only have enough 
space for 2 replicas; I'd imagine that at 3 replicas, the issue would be 
much harder to trigger.


The unfortunate reality is that if you run a configuration that's 
different than most you'll likely run into bugs that others do not.  (As 
a side note, this is why we should try to minimize configuration 
options, so that everyone is running much the same system.)  Hopefully 
squashing the two bugs you've filed will substantially help things.


Doug


Re: ${user.name}, ${user.host}?

2008-12-03 Thread Doug Cutting
Variables in configuration files may be Java system properties or other 
configuration parameters.  The list of pre-defined Java system 
properties is at:


http://java.sun.com/javase/6/docs/api/java/lang/System.html#getProperties()

Unfortunately the host name is not in that list.  You could define it by 
adding something like the following to conf/hadoop-env.sh:


HADOOP_DATANODE_OPTS="-Dhost.name=`hostname`"

Doug

sejong king wrote:

Hello,

In hadoop-site.xml, hadoop.tmp.dir uses ${user.name} to retrieve the username 
for its directory.  Is there a similar way to retrieve the name of the host 
that this particular datanode is starting up on? (in order to create unique 
directory names for each node.  something along the lines of: user.host, 
user.local.name, machine.name, etc...)

Thanks!
-Sej



  


Re: Which replica?

2008-12-01 Thread Doug Cutting
A task may read from more than one block.  For example, in line-oriented 
input, lines frequently cross block boundaries.  And a block may be read 
from more than one host.  For example, if a datanode dies midway through 
providing a block, the client will switch to using a different datanode. 
 So the mapping is not simple.  This information is also not, as you 
inferred, available to applications.  Why do you need this?  Do you have 
a compelling reason?


Doug

James Cipar wrote:
Is there any way to determine which replica of each chunk is read by a 
map-reduce program?  I've been looking through the hadoop code, and it 
seems like it tries to hide those kinds of details from the higher level 
API.  Ideally, I'd like the host the task was running on, the file name 
and chunk number, and the host the chunk was read from.


Amazon Web Services (AWS) Hosted Public Data Sets

2008-12-01 Thread Doug Cutting

This looks like it could be a great feature for EC2-based Hadoop users:

http://aws.amazon.com/publicdatasets/

Has anyone tried it yet?  Any datasets to share?

Doug


Re: Namenode BlocksMap on Disk

2008-12-01 Thread Doug Cutting

Billy Pearson wrote:
We are looking for a way to support smaller clusters also that might 
over run there heap size causing the cluster to crash.


Support for namespaces larger than RAM would indeed be a good feature to 
have.  Implementing this without impacting large cluster in-memory 
namenode performance should be possible, but may or may not be easy. 
You are welcome to tackle this task if it is a priority for you.


Doug


Re: Namenode BlocksMap on Disk

2008-11-26 Thread Doug Cutting

Brian Bockelman wrote:
Do you have any graphs you can share showing 50k opens / second (could 
be publicly or privately)?  The more external benchmarking data I have, 
the more I can encourage adoption amongst my university...


The 50k opens/second is from some internal benchmarks run at Y! nearly a 
year ago.  (It doesn't look like Y! runs that benchmark regularly 
anymore, as far as I can tell.)  I copied the graph to:


http://people.apache.org/~cutting/nn500.png

Note that all of the operations that modify the namespace top out at 
around 5k/second, since these are logged & flushed to disk.


I found some more recent micro namenode benchmarks at:

http://tinyurl.com/6bxoxz

These indicate that actual use doesn't hit these levels, but would 
still, on large clusters, be adversely affected by moving to a 
disk-based namespace.


Doug



Re: Namenode BlocksMap on Disk

2008-11-26 Thread Doug Cutting

Dennis Kubes wrote:
2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?


I think the assumption is that it would be considerably more than slight 
degradation.  I've seen the namenode benchmarked at over 50,000 opens 
per second.  If file data is on disk, and the namespace is considerably 
bigger than RAM, then a seek would be required per access.  At 
10MS/seek, that would give only 100 opens per second, or 500x slower. 
Flash storage today peaks at around 5k seeks/second.


For smaller clusters the namenode might not need to be able to perform 
50k opens/second, but for larger clusters we do not want the namenode to 
become a bottleneck.


Doug


Re: "Lookup" HashMap available within the Map

2008-11-25 Thread Doug Cutting

tim robertson wrote:

Thanks Alex - this will allow me to share the shapefile, but I need to
"one time only per job per jvm" read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?


In 0.19, with HADOOP-249, all tasks from a job can be run in a single 
JVM.  So, yes, you could access a static cache from Mapper.configure().


Doug



Re: SecondaryNameNode on separate machine

2008-10-31 Thread Doug Cutting

Otis Gospodnetic wrote:

Konstantin & Co, please correct me if I'm wrong, but looking at 
hadoop-default.xml makes me think that dfs.http.address is only the URL for the NN 
*Web UI*.  In other words, this is where we people go look at the NN.

The secondary NN must then be using only the Primary NN URL specified in fs.default.name. 
 This URL looks like hdfs://name-node-hostname-here/.  Something in Hadoop then knows the 
exact port for the Primary NN based on the URI schema (e.g. "hdfs://") in this 
URL.

Is this correct?


Yes.  The default port for an HDFS URI is 8020 (NameNode.DEFAULT_PORT). 
 The value of fs.default.name is used by HDFS.  When starting the 
namenode or datanodes, this must be an HDFS URI.  If this names an 
explicit port, then that will be used, otherwise the default, 8020 will 
be used.


The default port for HTTP URIs is 80, but the namenode typically runs 
its web UI on 50070 (the default for dfs.http.address).


Doug


Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Doug Cutting

Alex Loddengaard wrote:

That's the best I can do I think.  Can others chime in?


Another complicating factor is that, if a node dies, reduce tasks can be 
stalled waiting for map data to be re-generated.  So if all tasks were 
scheduled out of a single pool, one would need to be careful to never 
assign reduce tasks to all slots, since they could deadlock waiting for 
map data that would never be generated.  A few slots should thus be held 
back for map re-execution.  Alternately you could kill a reduce task to 
free a slot, but that could increase job latency considerably.


But I see no compelling reason that the scheduler couldn't handle such 
things.  Having separate pools for map and reduce tasks has some 
advantages (as we've both described).  And if you moved to a single pool 
then the task scheduler would need to take on these and perhaps other 
issues.


Doug


Re: Understanding file splits

2008-10-28 Thread Doug Cutting
This is hard to diagnose without knowing your InputFormat.  Each split 
returned by your #getSplits() implementation is passed to your 
#getRecordReader() implementation.  If your RecordReader is not stopping 
when you expect it to, then that's a problem in your RecordReader, no? 
Have you written a RecordReader from scratch?  If not, which have you 
modified?  And how have you modified it?


Doug

Malcolm Matalka wrote:

I am trying to write an InputFormat and I am having some trouble
understanding how my data is being broken up.  My input is a previous
hadoop job and I have added code to my record reader to print out the
FileSplit's start and end position, as well as where the last record I
read was located.  My record are all about 100 bytes so fairly small.
For one file I am seeing the following output:

 

start: 0 end: 45101881 pos: 67108800 

start: 45101880 end: 90203762 pos: 67108810 

start: 90203761 end: 135305643 pos: 134217621 


start: 135305642 end: 180170980 pos: 180170902

 


Note, I have also specified in my InputFormat that isSplittable return
false.

 


I do not understand why there is overlap.  Note that on the second one,
I never appear to reach the end position.

 


Any suggestions?

 


Thanks




Re: Using hadoop as storage cluster?

2008-10-27 Thread Doug Cutting

David C. Kerber wrote:

There would be quite a few files in the 100kB to 2MB range, which are received 
and processed daily, with smaller numbers ranging up to ~600MB or so which are 
summarizations of many of the daily data files, and maybe a handful in the 1GB 
-  6GB range (disk images and database backups, mostly).  There would also be a 
few (comparatively few, that is) configuration files of a few kB each.


File size isn't so much the issue as is the total number of files.  If 
the total number of files that will be in the filesystem at a time is 
less than a few million, then you'll probably be fine.  If you need ten 
million files then you may still be fine if your namenode has a lot of 
memory (e.g., 16GB).  If you need a lot more than that, then HDFS is 
probably not be well suited to your task.


Doug


Re: Distributed cache Design

2008-10-16 Thread Doug Cutting

Bhupesh Bansal wrote:

Minor correction the graph size is about 6G and not 8G.


Ah, that's better.

With the jvm reuse feature in 0.19 you should be able to load it once 
per job into a static, since all tasks of that job can share a JVM. 
Things will get tight if you try to run two such jobs at once, since 
JVMs are only shared by a single job.


https://issues.apache.org/jira/browse/HADOOP-249

Doug




Re: Hadoop chokes on file names with ":" in them

2008-10-10 Thread Doug Cutting
The safest thing is to restrict your Hadoop file names to a 
common-denominator set of characters that are well supported by Unix, 
Windows, and URIs.  Colon is a special character on both Windows and in 
URIs.  Quoting is in theory possible, but it's hard to get it right 
everywhere in practice.  One can devise heuristics that determine 
whether a colon is intende to be part of a name in a relative path 
rather than indicating a URI scheme or a Windows device, but making sure 
that all components observe that heuristic (Java's URI handler, Windows 
FS, etc.) is impossible and this leads to inconsistent behavior.  HDFS 
prohibits colons in filenames for this reason.


Doug

Brian Bockelman wrote:

Hey all,

Hadoop tries to parse file names with ":" in them as a relative URL:

[EMAIL PROTECTED] ~]$ hadoop fs -put /tmp/test 
/user/brian/StageOutTest-24328-Fri-Oct-10-07:58:44-2008
put: Pathname /user/brian/StageOutTest-24328-Fri-Oct-10-07:58:44-2008 
from /user/brian/StageOutTest-24328-Fri-Oct-10-07:58:44-2008 is not a 
valid DFS filename.

Usage: java FsShell [-put  ... ]

Our users do timestamps like that *a lot*.  It appears that Hadoop tries 
to interpret the ":" as a sign that you are trying to use a relative URL.


Is there any reason to not support the ":" character in file names?

Brian


Re: LZO and native hadoop libraries

2008-09-30 Thread Doug Cutting

Arun C Murthy wrote:
 You need to add libhadoop.so to your java.library.patch. libhadoop.so 
is available in the corresponding release in the lib/native directory.


I think he needs to first build libhadoop.so, since he appears to be 
running on OS X and we only provide Linux builds of this in releases.


Doug


Re: "Could not get block locations. Aborting..." exception

2008-09-29 Thread Doug Cutting

Raghu Angadi wrote:
For the current implementation, you need around 3x fds. 1024 is too low 
for Hadoop. The Hadoop requirement will come down, but 1024 would be too 
low anyway.


1024 is the default on many systems.  Shouldn't we try to make the 
default configuration work well there?  If not, we should document this 
prominently in the "Cluster Setup" documentation?


Doug


Re: Monthly Hadoop User Group Meeting (Bay Area)

2008-09-09 Thread Doug Cutting

Chris K Wensel wrote:
doh, conveniently collides with the GridGain and GridDynamics 
presentations:


http://web.meetup.com/66/calendar/8561664/


Bay Area Hadoop User Group meetings are held on the third Wednesday 
every month.  This has been on the calendar for quite a while.


Doug


Re: About the nama of second-namenode!!

2008-09-08 Thread Doug Cutting
Changing it will unfortunately cause confusion too.  Sigh.  This is why 
we should take time to name things well the first time.


Doug

叶双明 wrote:

Because the name of second-namenode making so much confusing, does the
hadoop team consider to change it?



Re: Hadoop + Elastic Block Stores

2008-09-08 Thread Doug Cutting

Ryan LeCompte wrote:
I'd really love to one day 
see some scripts under src/contrib/ec2/bin that can setup/mount the EBS 
volumes automatically. :-)


The fastest way might be to write & contribute such scripts!

Doug


Re: JVM Spawning

2008-09-05 Thread Doug Cutting
LocalJobRunner allows you to test your code with everything running in a 
single JVM.  Just set mapred.job.tracker=local.


Doug

Ryan LeCompte wrote:

I see... so there really isn't a way for me to test a map/reduce
program using a single node without incurring the overhead of
upping/downing JVM's... My input is broken up into 5 text files is
there a way I could start the job such that it only uses 1 map to
process the whole thing? I guess I'd have to concatenate the files
into 1 file and somehow turn off splitting?

Ryan


On Wed, Sep 3, 2008 at 12:09 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote:


Beginner's question:

If I have a cluster with a single node that has a max of 1 map/1
reduce, and the job submitted has 50 maps... Then it will process only
1 map at a time. Does that mean that it's spawning 1 new JVM for each
map processed? Or re-using the same JVM when a new map can be
processed?

It creates a new JVM for each task. Devaraj is working on
https://issues.apache.org/jira/browse/HADOOP-249
which will allow the jvms to run multiple tasks sequentially.

-- Owen



Re: Timeouts at reduce stage

2008-09-04 Thread Doug Cutting

Jason Venner wrote:
We have modified the /main/ that launches the children of the task 
tracker to explicity exit, in it's finally block. That helps substantially.


Have you submitted this as a patch?

Doug


Re: Are lines broken in dfs and/or in InputSplit

2008-08-07 Thread Doug Cutting

Kevin wrote:

Yes, I have looked at the block files and it matches what you said. I
am just wondering if there is some property or flag that would turn
this feature on, if it exists.


No.  If you required this then you'd need to pad your data, but I'm not 
sure why you'd ever require it.  Running off the end of a block in 
mapreduce makes for a small amount of non-local i/o, but it's generally 
insignificant.


Doug


Re: Confusing NameNodeFailover page in Hadoop Wiki

2008-08-06 Thread Doug Cutting

Konstantin Shvachko wrote:

Imho we either need to correct it or remove.


+1

Doug


Re: map reduce map reduce?

2008-07-30 Thread Doug Cutting

Elia Mazzawi wrote:

is it possible to run a map then reduce then a map then a reduce.
its really 2 jobs, but i don't want to store the intermediate results. 
so can a hadoop job do more than one map/reduce?


This has been discussed several times before.  The problem is that 
temporary data is not replicated, and hence does not survive node 
failure.  So the system would need to determine which tasks must be 
re-executed when a node fails, which can get complicated.  Back-tracking 
re-execution would also affect performance significantly.  In the worst 
case, a long-running chained job might never complete if the rate of 
node failure is higher than the length of the job.  Running things as 
multiple jobs thus both simplifies the system's failure management and 
provides periodic, durable checkpoints.


Related useful features are:
 - One can run map without reduce by setting num.reduce.tasks=0. 
Outputs are then written directly by map tasks.
 - One can reduce the replication level of one's intermediate output, 
by setting dfs.replication in a job.


Doug


Re: Namenode Exceptions with S3

2008-07-17 Thread Doug Cutting

Tom White wrote:

You can allow S3 as the default FS, it's just that then you can't run
HDFS at all in this case. You would only do this if you don't want to
use HDFS at all, for example, if you were running a MapReduce job
which read from S3 and wrote to S3.


Can't one work around this by using a different configuration on the 
client than on the namenodes and datanodes?  The client should be able 
to set fs.default.name to an s3: uri, while the namenode and datanode 
must have it set to an hdfs: uri, no?


Would it be useful to add command-line options to namenode and datanode 
that override the configuration, so that one could start non-default 
HDFS daemons?



It might be less confusing if the HDFS daemons didn't use
fs.default.name to define the namenode host and port. Just like
mapred.job.tracker defines the host and port for the jobtracker,
dfs.namenode.address (or similar) could define the namenode. Would
this be a good change to make?


Probably.  For back-compatibility we could leave it empty by default, 
deferring to fs.default.name, only if folks specify a non-empty 
dfs.namenode.address would it be used.


Doug


Re: Getting stats of running job from within job

2008-07-03 Thread Doug Cutting

Nathan Marz wrote:
Is there a way to get stats of the currently running job 
programatically?


This should probably be an FAQ.  In your Mapper or Reducer's configure 
implementation, you can get a handle on the running job with:


RunningJob running =
  new JobClient(job).getJob(job.get("mapred.job.id"));

Doug


Re: When is Hadoop 0.18 release scheduled ?

2008-06-27 Thread Doug Cutting

Tarandeep Singh wrote:

When is Hadoop 0.18 release scheduled ? This link has a date of 6 June :-/
http://issues.apache.org/jira/browse/HADOOP/fixforversion/12312972


The release date is initially set to the feature freeze date.  It's 
updated when all of the blockers are fixed and an actual release is 
made.  0.18.0 was branched a few weeks ago, but still has a few open 
blockers.  We'll release it when those are all resolved and no new 
blockers have arisen, which should be in the next few weeks.


Doug


Re: How Mappers function and solultion for my input file problem?

2008-06-26 Thread Doug Cutting

Ted Dunning wrote:

The map task is not multi-threaded  [ ... ]


Unless you specify a multi-threaded MapRunnable...

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultithreadedMapRunner.html

Doug


Re: client connect as different username?

2008-06-12 Thread Doug Cutting

Chris Collins wrote:
For instance, that all it requires is for me to create the ability for 
say a mac user with login of bob to access things under /bob is for me 
to go in as the super user and do something like:


hadoop dfs -mkdir /bob
hadoop dfs -chown bob /bob

where bob literally doesnt exist on the hdfs box and was not mentioned 
prior to those two commands.


This should actually be "/user/bob", since that's the home directory for 
"bob" in HDFS.  I agree that mention of this this would make a good 
addition to the documentation.


Doug


Re: client connect as different username?

2008-06-11 Thread Doug Cutting

Chris Collins wrote:
You are referring to creating a directory in hdfs?  Because if I am user 
chris and the hdfs only has user foo, then I cant create a directory 
because I dont have perms, infact I cant even connect.


Today, users and groups are declared by the client.  The namenode only 
records and checks against user and group names provided by the client. 
 So if someone named "foo" writes a file, then that file is owned by 
someone named "foo" and anyone named "foo" is the owner of that file. 
No "foo" account need exist on the namenode.


The one (important) exception is the "superuser".  Whatever user name 
starts the namenode is the superuser for that filesystem.  And if "/" is 
not world writable, a new filesystem will not contain a home directory 
(or anywhere else) writable by other users.  So, in a multiuser Hadoop 
installation, the superuser needs to create home directories and project 
directories for other users and set their protections accordingly before 
other users can do anything.  Perhaps this is what you've run into?


Doug


Re: Stackoverflow

2008-06-04 Thread Doug Cutting

Andreas Kostyrka wrote:

java.lang.StackOverflowError
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:494)
at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:82)


Quicksort is known to cause stack overflows when sorting large, 
already-sorted data.  Could that be the issue here?


http://safari.oreilly.com/0201361205/ch07lev1sec3
http://www.finesse.demon.co.uk/steven/sorting.html#quicksort
http://www.seeingwithc.org/topic2html.html

Doug



Re: FileSystem.listStatus() on S3

2008-05-29 Thread Doug Cutting

This is discussed in:

https://issues.apache.org/jira/browse/HADOOP-3095

If this gets fixed in the next week it will make it into 0.18.

Doug

Kyle Sampson wrote:
We're using Hadoop 0.17 with S3 as the filesystem.  We've created a 
custom InputFormat for our data.  One of the things it needs to do is on 
InputFormat.getSplits() list all of the files and directories under a 
certain path, and there may be thousands of entries in there.  It's 
using FileSystem.listStatus() to get these paths.  With S3, this is 
turning out to be extraordinarily slow with directories that contain on 
the order of thousands of subdirectories and files.


Looking into it a bit, it seems listStatus() is making a call to S3 for 
every subdirectory or file found to get extra file status information.  
It seems there used to be a listPaths() method that would just get the 
paths, but that's been deprecated and removed.  Is there any way 
currently to get just a list of paths without status information?


Kyle Sampson
[EMAIL PROTECTED]







Re: splitting of big files?

2008-05-29 Thread Doug Cutting

Erik Paulson wrote:

When reading from HDFS, how big are the network read requests, and what
controls that? Or, more concretely, if I store files using 64Meg blocks
in HDFS and run the simple word count example, and I get the default of
one FileSplit/Map task per 64 meg block, how many bytes into the second 64meg
block will a mapper read before it first passes a buffer up to the record
reader to see if it has found an end-of-line?


This is controlled by io.file.buffer.size, which is 4k by default.

Doug


Re: Remote Job Submission

2008-05-23 Thread Doug Cutting

Ted Dunning wrote:

- in order to submit the job, I think you only need to see the job-tracker.
Somebody should correct me if I am wrong.


No, you also need to be able to write the job.xml, job.jar, and 
job.split into HDFS.  Someday perhaps we'll pass these via RPC to the 
jobtracker and have it store them in HDFS, but currently JobClient 
assumes that the submitter has access to HDFS.


Doug


Re: Block re-balancing speed/slowness

2008-05-12 Thread Doug Cutting

Otis Gospodnetic wrote:

10 GB in 3 h doesn't that seem slow?


Have you played with dfs.balance.bandwidthPerSec?  It defaults to 
1MB/sec per datanode.  That would be about 10GB in 3 hours.


Doug



Re: newbie how to get url paths of files in HDFS

2008-05-08 Thread Doug Cutting

Ted Dunning wrote:

Take the fully qualified HDFS path that looks like this:

hdfs://namenode-host-name:port/file-path

And transform it into this:

hdfs://namenode-host-name:web-interface-port/data/file-path

The web-interface-port is 50070 by default.  This will allow you to read HDFS 
files via HTTP.


Also, starting in release 0.18.0, Java programs can use "hdfs:" URLs. 
For example, one can create a URLClassLoader for a jar stored in HDFS.


Doug


Re: where is the documentation for MiniDFSCluster

2008-05-05 Thread Doug Cutting

Maneesha Jain wrote:

I'm looking for any documentation or javadoc for MiniDFSCluster and have not
been able to find it anywhere.

Can someone please point me to it.


http://svn.apache.org/repos/asf/hadoop/core/trunk/src/test/org/apache/hadoop/dfs/MiniDFSCluster.java

This is part of the test code, not a public, end-user API, so we do not 
publish documentation for it.


Doug


Re: Master Heap Size and Master Startup Time vs. Number of Blocks

2008-05-02 Thread Doug Cutting

Cagdas Gerede wrote:

We will have 5 million files each having 20 blocks of 2MB. With the minimum
replication of 3, we would have 300 million blocks.
300 million blocks would store 600TB. At ~10TB/node, this means a 60 node
system.

Do you think these numbers are suitable for Hadoop DFS.


Why are you using such small blocks?  A larger block size will decrease 
the strain on Hadoop, but perhaps you have reasons?


Doug


Re: HDFS: Good practices for Number of Blocks per Datanode

2008-05-02 Thread Doug Cutting

Cagdas Gerede wrote:

For a system with 60 million blocks, we can have 3 datanodes with 20 million
blocks each, or we can have 60 datanodes with 1 million blocks each. In
either case, would there be performance implications or would they behave
the same way?


If you're using mapreduce, then you want your computations to run on 
nodes where the data is local.  The most cost-effective way to buy CPUs 
is generally in 2-8 core boxes that hold 2-4 hard drives, and this also 
generally gives good i/o performance.  In theory, boxes with 64 CPUs and 
64 drives each will perform similarly to 16 times as many boxes, each 
with 4 CPUs and 4 drives, but the former is both more expensive, and, 
when a box fails, you take a bigger hit.  Also, with more boxes, you 
generally get more network interfaces and hence more aggregate 
bandwidth, assuming you have a good switch.


Doug


Re: Master Heap Size and Master Startup Time vs. Number of Blocks

2008-05-02 Thread Doug Cutting

Cagdas Gerede wrote:

In the system I am working, we have 6 million blocks total and the namenode
heap size is about 600 MB and it takes about 5 minutes for namenode to leave
the safemode.


How big is are your files?  Are they several blocks on average?  Hadoop 
is not designed for small files, but rather for larger files.  An 
Archive system is currently being designed to help with this.


https://issues.apache.org/jira/browse/HADOOP-3307


I try to estimate what would be the heap size if we have 100 - 150 million
blocks, and what would be the amount of time for namenode to leave the
safemode.


At ~100M per block, 100M blocks would store 10PB.  At ~1TB/node, this 
means a ~10,000 node system, larger than Hadoop currently supports well 
(for this and other reasons).


If your files are generally large, you can increase your block size to 
250MB to decrease the number of blocks in the system.


Doug



Re: Block reports: memory vs. file system, and Dividing offerService into 2 threads

2008-04-30 Thread Doug Cutting

dhruba Borthakur wrote:

My current thinking is that "block report processing" should compare the
blkxxx files on disk with the data structure in the Datanode memory. If
and only if there is some discrepancy between these two, then a block
report be sent to the Namenode. If we do this, then we will practically
get rid of 99% of block reports.


Doesn't this assume that the namenode and datanode are 100% in sync? 
Another purpose of block reports is to make sure that the namenode and 
datanode agree, since failed RPCs, etc. might have permitted them to 
slip out of sync.  Or are we now confident that these are never out of 
sync?  Perhaps we should start logging whenever a block report surprises?


Long ago we talked of implementing partial, incremental block reports. 
We'd divide blockid space into 64 sections.  The datanode would ask the 
namenode for the hash of its block ids in a section.  Full block lists 
would then only be sent when the hash differs.  Both sides would 
maintain hashes of all sections in memory.  Then, instead of making a 
block report every hour, we'd make a 1/64 block id check every minute.


Doug


Re: Best practices for handling many small files

2008-04-28 Thread Doug Cutting

Joydeep Sen Sarma wrote:

There seems to be two problems with small files:
1. namenode overhead. (3307 seems like _a_ solution)
2. map-reduce processing overhead and locality 


It's not clear from 3307 description, how the archives interface with
map-reduce. How are the splits done? Will they solve problem #2?


Yes, I think 3307 will address (2).  Many small files will be packed 
into fewer larger files, each file typically substantially larger than a 
block.  A splitter can read the index files and then use 
MultiFileInputFormat, so that each split could contain files that are 
contained almost entirely in a single block.


Good MapReduce performance is a requirement for the design of 3307.

Doug


Re: hadoop and deprecation

2008-04-24 Thread Doug Cutting

Karl Wettin wrote:

When is depricated methods removed from the API? At new every minor?


http://wiki.apache.org/hadoop/Roadmap

Note the remark: "Prior to 1.0, minor releases follow the rules for 
major releases, except they are still made every few months."


So, since we're still pre-1.0, we try to remove deprecated features in 
the next minor release after they're deprecated.  We don't always manage 
to do this.  Some deprecated features have survived a long time, 
especially when they're widely used internally.


When deprecated code is removed the change is described in the 
"incompatible" section of the release notes.


Doug


Re: submitting map-reduce jobs without creating jar file ?

2008-04-23 Thread Doug Cutting

Ted Dunning wrote:

I haven't distributed it formally yet.

If you would like a tarball, I would be happy to send it.


Can you attach it to a Jira issue?  Then we can target it for a contrib 
module or somesuch.


Doug


Re: Using ArrayWritable of type IntWritable

2008-04-21 Thread Doug Cutting

CloudyEye wrote:

What else do I have to override in "ArrayWritable " to get the IntWritable
values written to the output files by the reducers?


public String toString();

Doug


Re: jar files on NFS instead of DistributedCache

2008-04-18 Thread Doug Cutting

Mikhail Bautin wrote:

Specifically, I just need a way to alter the child JVM's classpath via
JobConf, without having the framework copy anything in and out of HDFS,
because all my files are already accessible from all nodes.  I see how to do
that by adding a couple of lines to TaskRunner's run() method, e.g.:

  classPath.append(sep);
  classPath.append(conf.get("mapred.additional.classpath"));

or something similar.  Is there already such a feature or should I just go
ahead and implement it?


The child inherits the parent's classpath too.  So you could instead add 
 a directory to the classpath of your tasktrackers when you start them 
by adding it to HADOOP_CLASSPATH in conf/hadoop-env.sh.  Then it's 
visible to all jobs, which may or may not be of concern.


Doug


Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Doug Cutting

Ian Tegebo wrote:

The wiki has been down for more than a day, any ETA?


http://monitoring.apache.org/status/ is the place to look.  It currently 
confirms the outage, but does not give an ETA.



I was going to search the
archives for the status, but I'm getting 403's for each of the Archive links on
the mailing list page:

http://hadoop.apache.org/core/mailing_lists.html


Those archive links do seem broken.  I think that's perhaps due to the 
same outage.  However these are available from:


http://mail-archives.apache.org/mod_mbox/
http://www.mail-archive.com/
http://www.nabble.com/Hadoop-f17066.html
http://markmail.org/search/?q=hadoop

Etc.


My original question was about specifying MaxMapTaskFailuresPercent as a job
conf parameter on the command line for streaming jobs.  Is there a conf setting
like the following?

mapred.taskfailure.percent

My use case is log parsing files in DFS whose replication is
necessarily set to 1; I'm
seeing block retrieval failures that kill my job (I don't care if ~10% fail).


http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.max.tracker.failures



  mapred.max.tracker.failures
  4
  The number of task-failures on a tasktracker of a given
   job after which new tasks of that job aren't assigned to it.
  


Doug


Re: Performance / cluster scaling question

2008-03-28 Thread Doug Cutting

Doug Cutting wrote:
Seems like we should force things onto the same availablity zone by 
default, now that this is available.  Patch, anyone?


It's already there!  I just hadn't noticed.

https://issues.apache.org/jira/browse/HADOOP-2410

Sorry for missing this, Chris!

Doug


Re: Performance / cluster scaling question

2008-03-28 Thread Doug Cutting

Chris K Wensel wrote:
FYI, Just ran a 50 node cluster using one of the new kernels for Fedora 
with all nodes forced onto the same 'availability zone' and there were 
no timeouts or failed writes.


Seems like we should force things onto the same availablity zone by 
default, now that this is available.  Patch, anyone?


Doug


Re: Do multiple small files share one block?

2008-03-27 Thread Doug Cutting

Robert Krüger wrote:
this seems like an FAQ but I didn't explicitly see it in the docs: Is 
the minmium size a file occupies on HDFS controlled by the block size, 
i.e. would using the default block size of 64 B result in consumption of 
64 MB if I stored a file of 1 byte?


No.  The last block in a file is only be as long as it needs to be.

Doug


Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Doug Cutting

Chang Hu wrote:
Code below, also attached.  I put this together from the word count 
example. 


The problem is with your combiner.  When a combiner is specified, it 
generates the final map output, since combination is a map-side 
operation.  Your combiner takes  generated by your 
mapper and generates  outputs, which would not be acceptable 
to your reducer, which accepts  inputs.  Does that 
make sense?


Doug


Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Doug Cutting
Can you produce a simple, standalone example program that fails in this 
way, and post it to the list?  Thanks!


Doug


Re: using a set of MapFiles - getting the right partition

2008-03-20 Thread Doug Cutting

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],%20org.apache.hadoop.mapred.Partitioner,%20K,%20V)

MapFileOutputFormat#getEntry() does this.

Use MapFileOutputFormat#getReaders() to create the readers parameter.

Doug

Chris Dyer wrote:

Hi all--
I would like to have a reducer generate a MapFile so that in later
processes I can look up the values associated with a few keys without
processing an entire sequence file.  However, if I have N reducers, I
will generate N different map files, so to pick the right map file I
will need to use the same partitioner as was used when partitioning
the keys to reducers (the reducer I have running emits one value for
each key it receives and no others).  Should this be done manually, ie
something like readers[partioner.getPartition(...)] or is there
another recommended method?

Eventually, I'm going to migrate to using HBase to store the key/value
pairs (since I'd to take advantage of HBase's ability to cache common
pairs in memory for faster retrieval), but I'm interested in seeing
what the performance is like just using MapFiles.

Thanks,
Chris




Re: MapFile and MapFileOutputFormat

2008-03-20 Thread Doug Cutting

Rong-en Fan wrote:

I have two questions regarding the mapfile in hadoop/hdfs. First, when using
MapFileOutputFormat as reducer's output, is there any way to change
the index interval (i.e., able to call setIndexInterval() on the
output MapFile)?


Not at present.  It would probably be good to change MapFile to get this 
value from the Configuration.  A static method could be added, 
MapFile#setIndexInterval(Configuration conf, int interval), that sets 
"io.mapfile.index.interval", and the MapFile constructor could read this 
property from the Configuration.  One could then use the static method 
to set this on jobs.


If you need this, please file an issue in Jira.  If possible, include a 
patch too.


http://wiki.apache.org/hadoop/HowToContribute


Second, is it possible to tell what is the position in data file for a given
key, assuming index interval is 1 and # of keys are small?


One could read the "index" file explicitly.  It's just a SequenceFile, 
listing keys and positions in the "data" file.  But why would you set 
the index interval to 1?  And why do you need to know the position?


Doug


Re: Partitioning reduce output by date

2008-03-19 Thread Doug Cutting

Otis Gospodnetic wrote:

That "numPartitions" corresponds to the number of reduce tasks.  What I need is 
partitioning that corresponds to the number of unique dates (-mm-dd) processed by the 
Mapper and not the number of reduce tasks.  I don't know the number of distinct dates in 
the input ahead of time, though, so I cannot just specify the same number of reduces.

I *can* get the number of unique dates by keeping track of dates in map().  I 
was going to take this approach and use this number in the getPartition() 
method, but apparently getPartition(...) is called as each input row is 
processed by map() call.  This causes a problem for me, as I know the total 
number of unique dates only after *all* of the input is processed by map().


The number of partitions is indeed the number of reduces.  If you were 
to compute it during map, then each map might generate a different 
number.  Each map must partition into the same space, so that all 
partition 0 data can go to one reduce, partition 1 to another, and so on.


I think Ted pointed you in the right direction: your Partitioner should 
partition by the hash of the date, then your OutputFormat should start 
writing a new file each time the date changes.  That will give you a 
unique file per date.


Doug


Re: Multiple Output Value Classes

2008-03-17 Thread Doug Cutting

Stu Hood wrote:

But I'm trying to _output_ multiple different value classes from a Mapper, and 
not having any luck.


You can wrap things in ObjectWritable.  When writing, this records the 
class name with each instance, then, when reading, constructs an 
appropriate instance and reads it.  It can wrap Writable, String, 
primitive types, and arrays of these.


Doug


Re: Separate data-nodes from worker-nodes

2008-03-14 Thread Doug Cutting

Andrey Pankov wrote:
It's a little bit expensive to have big cluster running for a long 
period, especially if you use EC2. So, as possible solution, we can 
start additional nodes and include them into cluster before running job, 
and then, after finishing, kill unused nodes.


As Ted has indicated, that should work.  It won't be as fast as if you 
keep the entire cluster running the whole time, but it will be much cheaper.


An alternative is to store your persistent data in S3.  Then you can 
shut down your cluster altogether when you're not computing.  Your 
startup time each day will be slower, since reading from S3 is slower 
than reading from HDFS, so this may or may not be practical for you.


Doug


Re: What's the best way to get to a single key?

2008-03-10 Thread Doug Cutting

Xavier Stevens wrote:

Thanks for everything so far.  It has been really helpful.  I have one
more question.  Is there a way to merge MapFile index/data files?


No.

To append text files you can use 'bin/hadoop fs -getmerge'.

To merge sorted SequenceFiles (like MapFile/index files) you can use:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.Sorter.html#merge(org.apache.hadoop.fs.Path[],%20org.apache.hadoop.fs.Path,%20boolean)

But this doesn't generate a MapFile.

Why is a single file preferable?

Doug


Re: What's the best way to get to a single key?

2008-03-04 Thread Doug Cutting

Xavier Stevens wrote:

Is there a way to do this when your input data is using SequenceFile
compression?


Yes.  A MapFile is simply a directory containing two SequenceFiles named 
"data" and "index".  MapFileOutputFormat uses the same compression 
parameters as SequenceFileOutputFormat.  SequenceFileInputFormat 
recognizes MapFiles and reads the "data" file.  So you should be able to 
just switch from specifying SequenceFileOutputFormat to 
MapFileOutputFormat in your jobs and everything should work the same 
except you'll have index files that permit random access.


Doug


Re: What's the best way to get to a single key?

2008-03-03 Thread Doug Cutting

Use MapFileOutputFormat to write your data, then call:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],%20org.apache.hadoop.mapred.Partitioner,%20K,%20V)

The documentation is pretty sparse, but the intent is that you open a 
MapFile.Reader for each mapreduce output, pass the partitioner used, the 
key, and the value to be read into.


A MapFile maintains an index of keys, so the entire file need not be 
scanned.  If you really only need the value of a single key then you 
might avoid opening all of the output files.  In that case you could 
might use the Partitioner and the MapFile API directly.


Doug


Xavier Stevens wrote:

I am curious how others might be solving this problem.  I want to
retrieve a record from HDFS based on its key.  Are there any methods
that can shortcut this type of search to avoid parsing all data until
you find it?  Obviously Hbase would do this as well, but I wanted to
know if there is a way to do it using just Map/Reduce and HDFS.

Thanks,

-Xavier





Re: Add your project or company to the powered by page?

2008-02-28 Thread Doug Cutting

I have added this to the wiki.  Thanks!

Doug

C G wrote:

Here is my contribution to the Hadoop Powered-by page:
   
  Visible Measures Corporation (www.visiblemeasures.com) uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products.  We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences.   Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008.
   
  Thanks,

  C G
   
  ---

  Christopher Gillett
  Chief Software Architect
  Visible Measures Corporation
  25 Kingston Street, 5th Floor
  Boston, MA  02111
  http://www.visiblemeasures.com



Eric Baldeschwieler <[EMAIL PROTECTED]> wrote:

  Hi Folks,

Let's get the word out that Hadoop is being used and is useful in 
your organizations, ok? Please add yourselves to the Hadoop powered 
by page, or reply to this email with what details you would like to 
add and I'll do it.


http://wiki.apache.org/hadoop/PoweredBy

Thanks!

E14

---
eric14 a.k.a. Eric Baldeschwieler
senior director, grid computing
Yahoo! Inc.




   
-

Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.




Re: Sorting output data on value

2008-02-22 Thread Doug Cutting

Tarandeep Singh wrote:

but isn't the output of reduce step sorted ?


No, the input of reduce is sorted by key.  The output of reduce is 
generally produced as the input arrives, so is generally also sorted by 
key, but reducers can output whatever they like.


Doug


Re: Add your project or company to the powered by page?

2008-02-22 Thread Doug Cutting

I added this to the wiki.

Doug

Jimmy Lin wrote:


University of Maryland
http://www.umiacs.umd.edu/~jimmylin/cloud-computing/index.html

We are one of six universities participating in IBM/Google's academic
cloud computing initiative.  Ongoing research and teaching efforts
include projects in machine translation, language modeling,
bioinformatics, email analysis, and image processing.


Eric Baldeschwieler wrote:

Hi Folks,

Let's get the word out that Hadoop is being used and is useful in your 
organizations, ok?  Please add yourselves to the Hadoop powered by 
page, or reply to this email with what details you would like to add 
and I'll do it.


http://wiki.apache.org/hadoop/PoweredBy

Thanks!

E14

---
eric14 a.k.a. Eric Baldeschwieler
senior director, grid computing
Yahoo!  Inc.









Re: define backwards compatibility

2008-02-21 Thread Doug Cutting

Joydeep Sen Sarma wrote:

i find the confusion over what backwards compatibility means scary - and i am 
really hoping that the outcome of this thread is a clear definition from the 
committers/hadoop-board of what to reasonably expect (or not!) going forward.


The goal is clear: code that compiles and runs warning-free in one 
release should not have to to be altered to try the next release.  It 
may generate warnings, and these should be addressed before another 
upgrade is attempted.


Sometimes it is not possible to achieve this.  In these cases 
applications should fail with a clear error message, either at 
compilation or runtime.


In both cases, incompatible changes should be well documented in the 
release notes.


This is described (in part) in http://wiki.apache.org/hadoop/Roadmap

That's the goal.  Implementing and enforcing it is another story.  For 
that we depend on developer and user vigilance.  The current issue seems 
a case of failure to implement the policy rather than a lack of policy.


Doug


Re: Add your project or company to the powered by page?

2008-02-21 Thread Doug Cutting

Dennis Kubes wrote:

 * [http://alpha.search.wikia.com Search Wikia]
  * A project to help develop open source social search tools.  We run a 
125 node hadoop cluster.


Done.

Doug


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Doug Cutting

Peter W. wrote:
one trillion links=(10k million links/10 links per page)=1000 million 
pages=one billion.


In English, a trillion usually means 10^12, not 10^10.

http://en.wikipedia.org/wiki/Trillion

Doug


Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Doug Cutting

Jason Venner wrote:
Is disk arm contention (seek) a problem in a 6 disk configuration as 
most likely all of the disks would be serving /local/ and /dfs/?


It should not be.  MapReduce i/o is is sequential, in chunks large 
enough that seeks should not dominate.


Doug


Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Doug Cutting

Jason Venner wrote:
We have 3 types of machines we can get, 2  disk, 6 disk  and 16 disk 
machines. They all have 4 dual core cpus.


The 2 disk machines have about 1 TB, the 6 disks about 3TB and the 16 
disk about 8TB. The 16 disk machines have about 25% slower CPU's than 
the 2/6 disk machines.


We handle a lot of bulky data, and don't think we can fit it all o the 
3TB machines if those are our sole compute/dfs nodes.


Your performance will be better if you buy enough of the 6 disk nodes to 
hold all your data than if you intermix 16 disk nodes.  Are the 16 disk 
nodes considerably cheaper per byte stored than the 6 disk boxes?


From my reading, I conjecture that an ideal configuration would be 1 
local disk per cpu for local data/reducing, and some number of separate 
disks for dfs.

Is this an accurate assessment?


DFS storage is typically local on compute nodes.

Doug


Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Doug Cutting
If you're building a cluster from scratch, why not put a medium number 
of disk on all nodes, rather than some with more and some with less? 
That's the optimal configuration for Hadoop, since it best distributes 
data among computing nodes.


Doug

Jason Venner wrote:
We are starting to build larger clusters, and want to better understand 
how to configure the network topology.
Up to now we have just been setting up a private vlan for the small 
clusters.


We have been thinking about the following machine configurations
Compute nodes with a number of spindles and medium disk, that also serve 
DFS
For every 4-8 of the above, one compute node with a large number of 
spindles with a large number of disks, to bulk out th DFS capacity.


We are wondering what the best practices are for network topology in 
clusters that are built out of the above building blocks.

We can readily have 2 or 4 network cards in each node.




Re: Hadoop upgrade wiki page

2008-02-05 Thread Doug Cutting

Marc Harris wrote:

The hadoop upgrade wiki page contains a small typo
http://wiki.apache.org/hadoop/Hadoop_Upgrade .
[ ... ]  I don't have access
to modify it, but someone else might like to.


Anyone can create themselves an account on the wiki and modify any page.

Doug


Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Doug Cutting

Ted Dunning wrote:

The question I would like to pose to the community is this:

  What is the best way to proceed with code like this that is not ready for
prime time, but is ready for others to contribute and possibly also use?
Should I follow the Jaql and Cascading course and build a separate
repository and web site or should I try to add this as a contrib package
like streaming?  Or should I just hand out source by hand for a little while
to get feedback?


In part it depends on how actively you think it will be developed and by 
how many developers.


If you think it'll mostly just be you, plus an occasional patch from 
others, then attaching it to a Jira issue with the intent of maintaining 
it in contrib would be reasonable.  After a few patches, we could make 
you a contrib committer so that you can maintain it there.


However if you think it will be actively maintained by a more sizable 
group of developers, then a separate repo with independent release 
cycles could be more workable.  This could be a Hadoop subproject, or a 
project hosted outside Apache.  Apache imposes more overhead on 
projects, especially  on startup, than places like Sourceforge or Google 
Code, but correspondingly provides more guarantees about IP and 
community.  You could start a project outside of Apache and move it in, 
but then it has to go through the Incubator.  Or you could start 
something as a contrib module within Hadoop Core, and, once it reaches a 
critical mass, try to promote it to a separate subproject.


Lots of possibilities...

Doug


Re: Hadoop future?

2008-02-01 Thread Doug Cutting

Lukas Vlcek wrote:

I think you have already heard rumours about Microsoft could buy Yahoo. Does
anybody have any idea how this could impact specifically Hadoop future?


First, Hadoop is an Apache project.  Y! contributes to it, along with 
others.  Apache projects are designed to be able to survive the 
departure of any single contributor.


Second, while I know no more than you do about the acquisition, my guess 
is that, it if goes through, it will not significantly increase the odds 
that Y! will cease contributing to Hadoop anytime soon.  That's always 
possible, and Hadoop would survive it, but I don't think it is likely. 
Y! has deeply embraced Hadoop, and separating the two would be 
counterproductive.  Y! is more valuable to MSFT with Hadoop than without it.


I do not speak for Y!.

Doug


Re: Low complexity way to write a file to hdfs?

2008-01-31 Thread Doug Cutting

Ted Dunning wrote:

Don't know.

I just build a simple patch against the trunk and cloned the UGI stuff from
the doGet method.

Will that work?


It should.  You'll have to specify the user & groups in the query 
string.  Looking at the code, it looks like this should be something 
like "&ugi=user,group1,group2;".


Doug


Re: rand-sort example

2008-01-30 Thread Doug Cutting

Arun C Murthy wrote:
I guess we need to bump up the default from 200 to 512, what do others 
think?


Perhaps instead we should try to figure out why it now takes more than 
200MB to run IdentityMapper?


Doug


Re: Does the input ALWAYS need to be a FILE type ?

2008-01-25 Thread Doug Cutting

rajarsi wrote:

Is it possible to just pass in a String [InputStream] into hadoop and if so
how ? 


Yes, this is possible.  You will need to implement your own InputFormat.

Doug



Re: hadoop file system browser

2008-01-23 Thread Doug Cutting

Ted Dunning wrote:

Wouldn't it work to just load balance a small farm of webdav servers?  That
would be better from a security point of view as well.


I think one should just run the webdav proxy locally whereever one needs 
webdav-based access, then always connect to localhost.  The namenode 
host and port could be the first component of the path that's exposed to 
webdav clients.


Doug