[
https://issues.apache.org/jira/browse/OOZIE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055213#comment-14055213
]
Robert Kanter commented on OOZIE-1561:
--------------------------------------
Thanks for uploading a [design document|^OozielogHAtechnicaldesigndoc.pdf]
[~bowenzhangusa]. The overall idea sounds fine, but there's a lot of areas
where I think we could run into problems that need to be addressed. In
particular, I'm concerned about the HA stuff, which needs to be more fleshed
out. Here's some specific comments:
# It would be ideal if Oozie's log4j configuration was more standardized (we've
seen complaints about this when users want to change something and it doesn't
work or they aren't "allowed" to); I think that if we're going to search the
HDFS logs for streaming, then we don't have to force 1 hour rollovers, etc on
the local log, right? In fact, we can probably get rid of the
OozieRollingPolicy and just use standard log4j appenders and configs.
# You say that the information in the writing queue would be replicated to
Zookeeper in case that Oozie goes down. What happens if HDFS goes down? Oozie
will keep building up this queue until it comes back? This could be a problem
because (a) Oozie will eventually use too much memory for this queue so we'd
need a max queue size where we'd have to start dropping logs and (b) I'm not
sure how much info ZK is happy with, but I could see that becoming a problem
too.
# I like the idea of having a separate log file per job (I've wanted us to do
that for a while) to make the log streaming much more efficient and finding
logs for a job easier, but HDFS doesn't like having lots of small files. And
if someone has, say a couple thousand Oozie jobs per day, the number of small
log files Oozie puts into HDFS can easily become a problem. We may need to use
some kind of archiving file format to combine them or something (I think Hadoop
has a file format for that already?). Perhaps we can have Oozie periodically
group older logs together?
# You say that ZKXLogStreamingService will still continue to stream logs from
servers directly, and only fall back to HDFS when there's a bad Oozie server.
If the logs are in HDFS, shouldn't it always look there for the logs without
talking to the other servers? Also, couldn't there still be missing log
messages if we queue up messages and write them in batches? Also, jobs can be
run on multiple Oozie servers, so they'll all need to be able to write to the
same job log files at different, and even sometimes at the same time; we'll
need to have (ZK) locks for writing and reading to and from the log files,
right?
> When using Oozie HA, the logs should also be HA
> -----------------------------------------------
>
> Key: OOZIE-1561
> URL: https://issues.apache.org/jira/browse/OOZIE-1561
> Project: Oozie
> Issue Type: Improvement
> Components: HA
> Affects Versions: trunk
> Reporter: Robert Kanter
> Assignee: Bowen Zhang
> Priority: Critical
> Attachments: OozielogHAtechnicaldesigndoc.pdf
>
>
> Currently, if an Oozie server goes down, the logs from that server become
> unavailable until the server comes back up. In the meantime, the user may or
> may not be aware that log messages could be missing when Oozie streams logs
> to the user.
> We should come up with a way to make the logs HA.
> Some ideas:
> # When rolling the logs, copy them into HDFS; Oozie servers can then read the
> log files directly from HDFS instead of each other
> #- The downside to this is that there will be a window where logs could still
> be missing as they only show up in HDFS after rolling over (default = 1hr)
> and Oozie servers would still have to contact each other for the last hour of
> logs
> #- The upside is that it minimizes the amount of logs that could be missing
> and would be fairly straightforward to implement
> # Log directly to HDFS
> #- The downside is that this may be complicated or tricky to get working
> properly
> #-- This also introduces a strict dependency on HDFS
> #- The upside is that this would completely solve the issue and Oozie servers
> would simply get all logs directly from HDFS
> # Log to ZooKeeper or a database
> #- I think the log files will be too big to do this
> I've assigned this to myself, but if someone wants to tackle this, feel free
> to reassign it. I think idea 2 is the most practical, but I'm also open to
> other ideas on how to do this.
--
This message was sent by Atlassian JIRA
(v6.2#6252)