Re: Some queries about stability and reliability

Konstantin Shvachko Mon, 14 Aug 2006 11:43:21 -0700

Bryan, could you describe in more details how your files "rot"?
Do you loose entire files or blocks? How do you detect missing files?
How regular is "regular basis"?
Should we have a jira issue for that?

I think Jagadeesh' q#3 was about running a spare name node on the samecluster,

rather than running overlapping clusters.

On the logging issue. I think we should change the default logging level,
which is INFO at the moment.

Thanks,
Konstantin


Bryan A. Pendleton wrote:

I'm not sure I agree with Konstantin's assurances on reliability, justyet.I've been running a persistent cluster of hadoop since early thisyear, andhave repeatedly had data "rot". Things have gotten better lately, butit'snot clear that all of the necessary tests and ruggedness testing havebeen
done to stick this into a live production environment, especially without
additional backups in other storage systems. Still, a small portion of my
files "rot" on a regular basis.

To 1: Yes, with a replication of 3, any 2 nodes can fail with no loss of
data. If the failures happen slowly enough, more failures are tolerated,
supposing that there's time and space for the re-replications to takeplace.
I once had 5 of my ~30 nodes fail across a weekend, with no data becoming
unavailable for ongoing processes. There are currently no active teststhat
blocks which are reported present can actually be read from disk, though,
unless attempts to read the data are actually made.

To 2: Not at this time. However, if you temporarily turn replication *up*
for a large subset of your data, wait for things to become quiescent,thenturn replication back - you will effectively redistribute your freespace.Be aware that things might be "wonky" during heavy up-replications.I've hadmap jobs fail because the replication load made normal block accessestake
long enough to start losing tasks.

To 3: Using the current state of the code, you could run overlapping DFS
clusters. For each namenode you run, you'd want to use different storage
directories on the cluster nodes.
I don't know if you're likely to run into namenode performanceproblems justyet, though. It is worth noting that, for a very busy cluster, you'llwant
to periodically restart the namenode, to incorporate the change long into
the disk-stored image. Otherwise, the edits log might grow forever.Probably
take a lot longer than running out of space from your over-full logs,
though.

On 8/10/06, Jagadeesh <[EMAIL PROTECTED]> wrote:
Dear Konstantin Shvachko,

Thanks for your reply.
I have anyway decided to try Hadoop for our application and Isuccessfully
connected 22 nodes and it's working fine so far. I had one issue though
where the master node generated 5 log files of size 125GB each in 5days and
resulted in crashing the server. Anyway I have changed the properties in
log4j and fixed it.

I have a few more queries and really appreciate if you can give me an
answer to those.
1. If I set the replication level to 3 or 4, will I be able to accessthe
files even if one or two slave nodes go down.

2. If one of the slave nodes run out of disk space, will hadoop perform
any defragmentation process by moving some data blocks onto other nodes?
3. Can I run multiple master nodes with the same set of slaves andthen by
modifying the code have the master nodes communicating to each other
informing the data chunks stored in slave nodes? By this way we can run
multiple master nodes and provide options for load balancing andclustering.You would be the right person to suggest an approach and I can workon it.
Please let me know your thougts...

Thanks and Best
Jugs

-----Original Message-----
    From: "Konstantin Shvachko"<[EMAIL PROTECTED]>
    Sent: 8/10/06 11:27:13 PM
    To: "[email protected]"<[email protected]>
    Subject: Re: Some queries about stability and reliability

    Hi Jagadeesh,

    >I am very much new to Hadoop and would like to know some details
about the
    >reliability and stability. I am developing flickr kind of an
application for
>storing and sharing movies and would like to use Hadoop as mystorage>backend. I am planning to put in atleast 100 nodes and wouldlike to
know
>more about the product. I will appreciate if you could answersome of
my
    >queries.
    >
    >
    This is a very interesting application for Hadoop.
    Did you have any progress with the system?

    >1. Is the product matured enough for using in an application like
this?
    >
    >
    Yes.

    >2. Has somebody tested it using atleast 100 nodes?
    >
    >
    Yes, there are even larger installations.

    >3. Can I have multiple master nodes in Hadoop to do load balancing
and
    >fail-overs?
    >
    >
    Not yet.
>4. What is the maximum number of simultaneous connectionspossible in
    >Hadoop?
    >
    >
    Hadoop is designed to support and actually supports high volume of
    simultaneous connections.
    E.g., on a 100 node cluster an extensive map-reduce job can generate
400
    concurrent connections.

    Creation time and date is not implemented for DFS files.
    Do you have a good application for ctime?

    Thank you,

    --Konstantin

Re: Some queries about stability and reliability

Reply via email to