Hello EasyBuilders,
I have been considering of proposing these topics for some weeks now,
it seems the planets now came to alignment, partly thanks to the wave of v1.6 :)
v1.6 is a really good release that meets many expectations.
1) jumbo-toolchain for bio* category now working nicely
First, v1.6 has a couple interesting little easyconfigs, biodeps{,-extended},
which are supposed to be a common dependency list for life science applications,
that in turn permits to chain together modules in a pipeline of multiple
components.
Overall it works well and, you can entertain yourself by co-loading 70+ modules
in one go:
https://github.com/fgeorgatos/easybuild.experimental/blob/master/users/fgeorgatos/HPCBIOS/HPCBIOS_Bioinfo-goolf-1.4.10.eb
Some packages still need a tweak to be mergeable in (ABySS, BLAT, BLAST, NCBI*,
Trinity, TopHat)
IMHO, using it just for building the software in one-go is probably of more
practical use.
2) standardization, yeah even more
That said, trivialities like colliding Boost<->Boost-Python or
biodeps<->biodeps-extended, quickly point out that biodeps business
is really more of a "standardization" work rather than implementation;
that's why this brainstorm exists in the first place:
http://hpcbios.readthedocs.org/en/latest/HPCBIOS_2013-01.html
Comments on how to update/improve it and make it more generic are very welcome.
During this week, pfo made a comment about refreshing the SAMtools version;
I think at some point in time we would all benefit be upgrading these versions:
http://hpcbios.readthedocs.org/en/latest/HPCBIOS_2012-94.html
and you will find plenty more eventually desired to add in, such as:
http://www.eaglegenomics.com/2012/04/the-elements-of-bioinformatics/
As you can see, we already implement a good chunk of the latter via EasyBuild!
So, I'd suggest *bioinfo* stakeholders get their act together in some way,
eg. by running an e-sprint by end of year, to update towards some common
targets (ie. define a standard and the implementation, say, v20131201)
3) obtaining existing and new application sources
Life has it, that developers don't always have an understanding of what it means
to run HPC operations and play tricks around software distribution channels,
ranging from non-tagging a git repo (benign) up to flatly swapping tarballs
(annoying):
see https://github.com/hpcugent/easybuild-easyconfigs/pull/374/files
Especially modifying the release channel mid-flight, is a big hassle; I did
spend
an extra few hours to understand why Rgputools PR arrived broken on github: :-(
https://github.com/hpcugent/easybuild-easyconfigs/pull/282
In short, let's pool our tarballs together in some way,
define preferred release streams and avoid these issues re-occuring.
Bioinfo toolchain is especially prone to this, due to the many small tools.
4) arch-aware/distro-aware organization
It would be nice to poll opinions on different implementation directions
and come up with some common codebase, functional for different needs;
(easybuild has its own logic about deciding OS, fi. to setup CUDA)
fi. compare below $PRACE_ARCH with $BC_CPUTYPE:
http://www.prace-project.eu/PRACE-Common-Production
https://github.com/Gregor-Mendel-Institute/env_init/blob/master/components.d/99-environment
Recent Lmod work on "preloaded", "immutable" modules, may factorize this nicely.
Thanks to pfo/azet for this! question here is: can we improve it to share it
and how?
I really applause the fact they made a start and hope more will ride on this
concept!
5) better organize build environment variables
Interactive builds should be a little bit better confined (or documented)
for the unsuspecting newcomer. This relates both to satisfying PRACE mandates
(or be able to - see above for the link) and/or, resolving issues like:
https://github.com/hpcugent/easybuild-easyblocks/issues/100
The end result is to have some mechanism to prevent the case of loading ictce
and finding yourself building with gcc, as in:
http://my.cdash.org/testDetails.php?test=11714674&build=504277
Currently, $MPICC is a surprise exercise to get right (try `$MPICC --version`
etc) on the shell,
meanwhile EasyBuild appears to do the right thing when working in batch mode.
I guess I am missing something, perhaps we just need documentation here.
(OK, does the answer here relate to the stacked variables discussion via Lmod?)
6) Implement common HPC policies
This is kind of becoming production during this summer, ie. the idea of
organizing
your build-able applications sets into groups, corresponding to so-called
policies:
https://github.com/fgeorgatos/easybuild.experimental/tree/master/users/fgeorgatos/HPCBIOS
If you go through the URL chain, you will eventually realize that these
implement parts of:
http://www.ccac.hpc.mil/consolidated/bc/policy.php # see FY06-01, FY06-05,
later FY06-19/FY06-04
The idea is, to be able to go to a given site, try `eb HPCBIOS_*` and get the
world built.
Anybody here caring to give to the existing ones a spin and provide feedback?
Would you like to see them included under default EasyBuild repos?
7) Common list of (bioinformatics?) applications of interest
Last but not least:
So far, I have been somewhat reluctant to create an online spreadsheet
with all the bioinfo applications that people (may/might) care about,
yet it would be such a pity to find (again) ourselves in the
funny position of 2-3 centers doing the exact same work in parallel;
yet this happened a few times (eg. MUSCLE, Rosetta, scons, UDUNITS...)
So, how would you like to handle it to avoid duplication of work?
How many of you care about this at all?
thanks for reading this,
Fotis
--
echo "sysadmin know better bash than english" | sed s/min/mins/ \
| sed 's/better bash/bash better/' # Yelling in a CERN forum