[easybuild] Request for Comments: Bioinformatics packages and not only

Fotis Georgatos Sat, 3 Aug 2013 12:46:31 +0200 (CEST)

Hello EasyBuilders,

I have been considering of proposing these topics for some weeks now,
it seems the planets now came to alignment, partly thanks to the wave of v1.6 :)
v1.6 is a really good release that meets many expectations.



1) jumbo-toolchain for bio* category now working nicely

First, v1.6 has a couple interesting little easyconfigs, biodeps{,-extended},
which are supposed to be a common dependency list for life science applications,
that in turn permits to chain together modules in a pipeline of multiple 
components.

Overall it works well and, you can entertain yourself by co-loading 70+ modules 
in one go:
https://github.com/fgeorgatos/easybuild.experimental/blob/master/users/fgeorgatos/HPCBIOS/HPCBIOS_Bioinfo-goolf-1.4.10.eb
Some packages still need a tweak to be mergeable in (ABySS, BLAT, BLAST, NCBI*, 
Trinity, TopHat)
IMHO, using it just for building the software in one-go is probably of more 
practical use.


2) standardization, yeah even more

That said, trivialities like colliding Boost<->Boost-Python or 
biodeps<->biodeps-extended, quickly point out that biodeps business 
is really more of a "standardization" work rather than implementation;
that's why this brainstorm exists in the first place:
http://hpcbios.readthedocs.org/en/latest/HPCBIOS_2013-01.html
Comments on how to update/improve it and make it more generic are very welcome.

During this week, pfo made a comment about refreshing the SAMtools version;
I think at some point in time we would all benefit be upgrading these versions:
http://hpcbios.readthedocs.org/en/latest/HPCBIOS_2012-94.html
and you will find plenty more eventually desired to add in, such as:
http://www.eaglegenomics.com/2012/04/the-elements-of-bioinformatics/
As you can see, we already implement a good chunk of the latter via EasyBuild!
So, I'd suggest *bioinfo* stakeholders get their act together in some way,
eg. by running an e-sprint by end of year, to update towards some common
targets (ie. define a standard and the implementation, say, v20131201)


3) obtaining existing and new application sources

Life has it, that developers don't always have an understanding of what it means
to run HPC operations and play tricks around software distribution channels,
ranging from non-tagging a git repo (benign) up to flatly swapping tarballs 
(annoying):
see https://github.com/hpcugent/easybuild-easyconfigs/pull/374/files

Especially modifying the release channel mid-flight, is a big hassle; I did 
spend
an extra few hours to understand why Rgputools PR arrived broken on github: :-(
https://github.com/hpcugent/easybuild-easyconfigs/pull/282

In short, let's pool our tarballs together in some way,
define preferred release streams and avoid these issues re-occuring.
Bioinfo toolchain is especially prone to this, due to the many small tools.


4) arch-aware/distro-aware organization

It would be nice to poll opinions on different implementation directions
and come up with some common codebase, functional for different needs;
(easybuild has its own logic about deciding OS, fi. to setup CUDA)

fi. compare below $PRACE_ARCH with $BC_CPUTYPE:
http://www.prace-project.eu/PRACE-Common-Production
https://github.com/Gregor-Mendel-Institute/env_init/blob/master/components.d/99-environment
Recent Lmod work on "preloaded", "immutable" modules, may factorize this nicely.
Thanks to pfo/azet for this! question here is: can we improve it to share it 
and how?
I really applause the fact they made a start and hope more will ride on this 
concept!


5) better organize build environment variables

Interactive builds should be a little bit better confined (or documented)
for the unsuspecting newcomer. This relates both to satisfying PRACE mandates
(or be able to - see above for the link) and/or, resolving issues like:
https://github.com/hpcugent/easybuild-easyblocks/issues/100
The end result is to have some mechanism to prevent the case of loading ictce 
and finding yourself building with gcc, as in:
http://my.cdash.org/testDetails.php?test=11714674&build=504277

Currently, $MPICC is a surprise exercise to get right (try `$MPICC --version` 
etc) on the shell,
meanwhile EasyBuild appears to do the right thing when working in batch mode.
I guess I am missing something, perhaps we just need documentation here.
(OK, does the answer here relate to the stacked variables discussion via Lmod?)


6) Implement common HPC policies

This is kind of becoming production during this summer, ie. the idea of 
organizing
your build-able applications sets into groups, corresponding to so-called 
policies:
https://github.com/fgeorgatos/easybuild.experimental/tree/master/users/fgeorgatos/HPCBIOS
If you go through the URL chain, you will eventually realize that these 
implement parts of:
http://www.ccac.hpc.mil/consolidated/bc/policy.php # see FY06-01, FY06-05, 
later FY06-19/FY06-04
The idea is, to be able to go to a given site, try `eb HPCBIOS_*` and get the 
world built.
Anybody here caring to give to the existing ones a spin and provide feedback? 
Would you like to see them included under default EasyBuild repos?


7) Common list of (bioinformatics?) applications of interest

Last but not least:
So far, I have been somewhat reluctant to create an online spreadsheet 
with all the bioinfo applications that people (may/might) care about,
yet it would be such a pity to find (again) ourselves in the
funny position of 2-3 centers doing the exact same work in parallel;
yet this happened a few times (eg. MUSCLE, Rosetta, scons, UDUNITS...)
So, how would you like to handle it to avoid duplication of work?
How many of you care about this at all?


thanks for reading this,

Fotis


-- 
echo "sysadmin know better bash than english" | sed s/min/mins/ \
        | sed 's/better bash/bash better/' # Yelling in a CERN forum

[easybuild] Request for Comments: Bioinformatics packages and not only

Reply via email to