> On Jul 23, 2018, at 4:35 PM, Gav <[email protected]> wrote:
>
> Thanks Allen,
>
> Some of our nodes are only 364GB in total size, so you can see that this is
> an issue.
Ugh.
> For the H0-H12 nodes we are pretty fine currently with 2.4/2.6TB disks -
> therefore the urgency is on the Hadoop nodes H13 - H18 and the non Hadoop
> nodes.
>
> I propose therefore H0-H12 be trimmed on a monthly basis got mtime +31 in
> the workspace and the H13-H18 + the remaining nodes with 500GB disk and less
> by done
> weekly
>
> Sounds reasonable ?
Disclosure: I’m not really doing much with the Hadoop project anymore
so someone from that community would need to step forward.
But If I Were King:
For the small nodes in the Hadoop queue, I’d request they either get
pulled out or put into ‘Hadoop-small’ or some other similar name. Doing a
quick pass over the directory structure via Jenkins, with only one or two
outliers, everything there is ‘reasonable’.. ie., 400G drives are just
under-spec’ed for the full workload that the ‘Hadoop’ nodes are expected to do
these days. 7 days isn’t going to do it. Putting JUST the nightly jobs on them
(hadoop qbt, hbase nightly, maybe a handful of other jobs) would eat plenty of
disk space.
7 days then the workspace dir goes away is probably reasonable based on
the other nodes though. But it looks to me like there are jobs running on the
non-Hadoop nodes that probably should be in the Hadoop queue (Ambari, HBase,
Ranger, Zookeeper, probably others). Vice-versa is probably also true. It
might also be worthwhile to bug some of the vendors involved to see if they can
pony up some machines/cash for build server upgrades like Y!/Oath did/does.
That said, I potentially see some changes that the Apache Yetus project
can do to lessen the disk space load for those projects that use it. I’ll need
to experiment a bit first to be sure. Looking at 10s of G freed up if my
hypotheses are correct. That might be enough to not move nodes around in the
Hadoop queue but I can’t see that lasting long.
Jenkins allegedly has the ability to show compressed log files. It
might be worthwhile investigating doing something in this space on a global
level. Just gzip up every foo.log in workspace dirs after 24 hours or
something.
One other thing to keep in mind: the modification time on a directory
only changes if a direct child of that directory changes. There are likely
many jobs that have a directory structure such that the parent workspace
directory time is not modified. Any sort of purge job is going to need to be
careful not to nuke a directory structure like this that is being used. :)
HTH.