Author: plessy Date: 2009-02-19 10:16:14 +0000 (Thu, 19 Feb 2009) New Revision: 3130
Added: trunk/community/infrastructure/getData/ChangeLog trunk/community/infrastructure/getData/debian/compat trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk Removed: trunk/community/infrastructure/getData/getData.txt Modified: trunk/community/infrastructure/getData/debian/control trunk/community/infrastructure/getData/getData trunk/community/infrastructure/getData/getData.conf.d/dog.getData trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk trunk/community/infrastructure/getData/getData.conf.d/pdb.getData trunk/community/infrastructure/getData/getData.conf.d/rfam.getData Log: Added limited support for mouse RefSeq, and a changelog. 2009-02-19 Charles Plessy <[email protected]> * ChangeLog: Added. Let's follow the GNU coding standards. http://www.gnu.org/prep/standards/html_node/Change-Logs.html * getData.conf.d: Added support for mouse and human RefSeq. * getData.pl: Removed human RefSeq. * getData.conf.d/dog.getData, getData.conf.d/rfam.getData, getData.conf.d/pdb.getData: Print on STDERR only if verbose. * getData.conf.d/dog.getData.mk: added missing parenthesis around a make variable. Added: trunk/community/infrastructure/getData/ChangeLog =================================================================== --- trunk/community/infrastructure/getData/ChangeLog (rev 0) +++ trunk/community/infrastructure/getData/ChangeLog 2009-02-19 10:16:14 UTC (rev 3130) @@ -0,0 +1,12 @@ +2009-02-19 Charles Plessy <[email protected]> + + * ChangeLog: Added. Let's follow the GNU coding standards. + http://www.gnu.org/prep/standards/html_node/Change-Logs.html + + * getData.conf.d: Added support for mouse and human RefSeq. + * getData.pl: Removed human RefSeq. + + * getData.conf.d/dog.getData, getData.conf.d/rfam.getData, + getData.conf.d/pdb.getData: Print on STDERR only if verbose. + * getData.conf.d/dog.getData.mk: added missing parenthesis around a + make variable. Added: trunk/community/infrastructure/getData/debian/compat =================================================================== --- trunk/community/infrastructure/getData/debian/compat (rev 0) +++ trunk/community/infrastructure/getData/debian/compat 2009-02-19 10:16:14 UTC (rev 3130) @@ -0,0 +1 @@ +7 Modified: trunk/community/infrastructure/getData/debian/control =================================================================== --- trunk/community/infrastructure/getData/debian/control 2009-02-18 20:18:44 UTC (rev 3129) +++ trunk/community/infrastructure/getData/debian/control 2009-02-19 10:16:14 UTC (rev 3130) @@ -2,7 +2,8 @@ Section: science Priority: optional Maintainer: Steffen Moeller <[email protected]> -Build-Depends: cdbs, debhelper (>= 5) +Uploaders: Charles Plessy <[email protected]> +Build-Depends: cdbs, debhelper (>= 7) Standards-Version: 3.8.0 Homepage: http://debian-med.alioth.debian.org Modified: trunk/community/infrastructure/getData/getData =================================================================== --- trunk/community/infrastructure/getData/getData 2009-02-18 20:18:44 UTC (rev 3129) +++ trunk/community/infrastructure/getData/getData 2009-02-19 10:16:14 UTC (rev 3130) @@ -405,12 +405,6 @@ source => "wget $sharedWgetOptions http://www.reactome.org/download/interactions.README.txt http://www.reactome.org/download/current/homo_sapiens.interactions.txt.gz" }, -# Proof-of-principle for RefSeq. Does not include everything. - "refseq.hsa" => { - name => "The NCBI Reference Sequence project - Homo sapiens", - source => "wget $sharedWgetOptions ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/*ff.gz" - }, - "pfam-a" => { name => "Pfam-A : Manually curated protein families and domains, only the seed is presented.", source => "wget $sharedWgetOptions ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A-seed.gz" Added: trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData =================================================================== --- trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData (rev 0) +++ trunk/community/infrastructure/getData/getData.conf.d/RefSeq.getData 2009-02-19 10:16:14 UTC (rev 3130) @@ -0,0 +1,18 @@ +# Proof-of-principle for RefSeq. Does not include everything. +print STDERR "Reading Canis lupus familiaris configuration file\n" if $verbose; + +$toBeMirrored{"refseq.hsa"} = { + "name" => "The NCBI Reference Sequence project – Homo sapiens", + "tags" => ["human", "proteome", "transcriptome"], + "source" => "ln -s /etc/getData.conf.d/RefSeq.mk Makefile ; SPECIES=H_Sapiens make get unpack", + "post-download" => "make emboss blast" + }, + +$toBeMirrored{"refseq.mmu"} = { + "name" => "The NCBI Reference Sequence project – Mus musculus", + "tags" => ["mouse", "proteome", "transcriptome"], + "source" => "ln -s /etc/getData.conf.d/RefSeq.mk Makefile ; SPECIES=M_musculus make get unpack", + "post-download" => "make emboss blast" + }, + +1; Added: trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk =================================================================== --- trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk (rev 0) +++ trunk/community/infrastructure/getData/getData.conf.d/RefSeq.mk 2009-02-19 10:16:14 UTC (rev 3130) @@ -0,0 +1,13 @@ +SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions) + +# $SPECIES is provided to make in the call from /etc/getData.conf.d/RefSeq.getData. + +get: + wget $(SHARED_WGET_OPTIONS) ftp://ftp.ncbi.nih.gov/refseq/$(SPECIES)/mRNA_Prot/*ff.gz + +unpack: + for file in *ff.gz ; do zcat $$file > `basename $$file .gz` ; done + +blast: + +emboss: Modified: trunk/community/infrastructure/getData/getData.conf.d/dog.getData =================================================================== --- trunk/community/infrastructure/getData/getData.conf.d/dog.getData 2009-02-18 20:18:44 UTC (rev 3129) +++ trunk/community/infrastructure/getData/getData.conf.d/dog.getData 2009-02-19 10:16:14 UTC (rev 3130) @@ -1,4 +1,4 @@ -print STDERR "Reading Canis lupus familiaris configuration file\n"; +print STDERR "Reading Canis lupus familiaris configuration file\n" if $verbose; $toBeMirrored{"dog.genome"}={ "name" => "CanFam2.0 - Dog Genome Sequencing Project", Modified: trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk =================================================================== --- trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk 2009-02-18 20:18:44 UTC (rev 3129) +++ trunk/community/infrastructure/getData/getData.conf.d/dog.getData.mk 2009-02-19 10:16:14 UTC (rev 3130) @@ -1,7 +1,7 @@ SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions) get: - wget $SHARED_WGET_OPTIONS ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.BROADD2.50.dna.chromosome.*.fa.gz + wget $(SHARED_WGET_OPTIONS) ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.BROADD2.50.dna.chromosome.*.fa.gz unpack: for file in *chromosome.*.fa.gz ; do zcat $$file > `basename $$file .gz` ; done Modified: trunk/community/infrastructure/getData/getData.conf.d/pdb.getData =================================================================== --- trunk/community/infrastructure/getData/getData.conf.d/pdb.getData 2009-02-18 20:18:44 UTC (rev 3129) +++ trunk/community/infrastructure/getData/getData.conf.d/pdb.getData 2009-02-19 10:16:14 UTC (rev 3130) @@ -1,5 +1,5 @@ -print STDERR "Reading PDB configuration file\n"; +print STDERR "Reading PDB configuration file\n" if $verbose; $toBeMirrored{"pdb"}={ "name" => "PDB - protein structure database", Modified: trunk/community/infrastructure/getData/getData.conf.d/rfam.getData =================================================================== --- trunk/community/infrastructure/getData/getData.conf.d/rfam.getData 2009-02-18 20:18:44 UTC (rev 3129) +++ trunk/community/infrastructure/getData/getData.conf.d/rfam.getData 2009-02-19 10:16:14 UTC (rev 3130) @@ -1,4 +1,4 @@ -print STDERR "Reading Canis lupus familiaris configuration file\n"; +print STDERR "Reading Rfam configuration file\n" if $verbose; $toBeMirrored{"dog.genome"}={ "name" => "Rfam9.1 - Multiple alignments and covariance models of non-coding RNA families", Deleted: trunk/community/infrastructure/getData/getData.txt =================================================================== --- trunk/community/infrastructure/getData/getData.txt 2009-02-18 20:18:44 UTC (rev 3129) +++ trunk/community/infrastructure/getData/getData.txt 2009-02-19 10:16:14 UTC (rev 3130) @@ -1,207 +0,0 @@ -NAME - getData - retrieves databases from the Internet - -SYNOPSIS - getData [ --mirrordir <path> ] <list of db names> - - getData --list - -DESCRIPTION - Bioinformatics has the intrinsic problem to bring the biological data to - the end user. Astronomers have the equivalent problem and particle - physicists, well, they haven come up with (first) the web and (second) - the computational grids to address their problems. Debian helps with the - programs but will not provide such huge datasets that are even - frequently updated - not even in volatile.debian.org. Most - bioinformatics researchers will not need too many of such databases. And - even more so will gladly continue in using public services remotely. - - For those who need a set of databases on a regular basis, this script - shall be a start to automate the burden to download the data and update - indices and the like. The world has seen such magic before with the Lion - Biosciences Prisma tool - (http://bib.oxfordjournals.org/cgi/reprint/3/4/389.pdf) but how about - something simpler (as a start) that at least gets close to what we - desire and is Free. The aim must be to address the needs of all (most) - communities, not only of the bioinformatics world. The seed was hence - made with databases from astronomy. - - Please contact the Debian-Med community if you consider this program to - be almost ready for your needs and explain what still needs to be added. - Public databases that you managed to integrate with this system are also - very warmly welcomed as feedback. - -OPTIONS - --help - this help - - --man - Present a more detailed description in form of a man page. - - --verbose - Say one or two words more than required. - - --mirrordir <path> - Specifies destination directory. The data will be mirrored to the - folder $mirrordir/$dbname/. Please be aware that this mirrordir is - nowhere stored. The directory can consequently be moved to arbitrary - locations at any time, if the users of the data are only informed - about that moving. - - --list - Lists all databases that may be requested to be installed. - - <list of db names> - Only those databases that are explicitly requested to be downloaded - will be downloaded. Such databases may require considerable - bandwidth, so please make sure you know you are doing the right - thing. - - --post - Perform only the unpacking/indexing, but do not retrieve/update the - databases. This option is considered useful when adding a new - database management system to the system, e.g. after installing - EMBOSS. - - --source - Perform only the unpacking/indexing, but do not retrieve/update the - databases. This option may be beneficial when the site administator - is aware of current analyses that should not be disturbed by the - indexing process but the downloading from the net can already be - started. - - --confd <directory> - Allows for the specification of a directory in which multiple files - can be stored that will be read by getData upon its invocation. - These may add values to the global variable %toBeMirrored that - specifies the databases and their download scripts. - - --config <system> - Preparation of the configuration file that would be reuired for a - particular system that deals with the database. The configuration is - printed to stdout and is expected to be copied manually to the - proper file or folder. One could imagine this process to be - automated, though this is not yet implemented. Currently available - is support for two systems: - - emboss This specifies the EMBOSS suite of tools for bioinformatics - (www.emboss.org) that is also available as a Debian package. - The configuration for the Uniprot databases will allow the - sequence retrieval with the seqret tool. - - dre - ARC Grid Runtime Environment - Runtime environments (REs) are a concept of the ARC grid - middleware of which more can be learned on - http://www.nordugrid.org. A script is needed to indicate the - presence of a runtime environment. Here, the name of the - script is important, which is not definable by getData - though since it only writes to stdout. - - Unfortunately, the configuration was not yet be found to be - modularised. It all needs to happen within the getData script - itself. - - --remove <list of dbnames> - This command removes folders that store the data. In principle this - could be perfomed manually, though some databases may have special - requirements pre- or post-removal, which can be specified - individually for every database. - -SPECIFICATION OF DATABASES - Databases for download and their post-processing are specified at two - different locations. One is the getData script itself, the other are - files stored in /etc/getData.d. Either will define elements of a - considerably large hash. The key is the identifier which is also shown - by the 'getData --list' directive. The value is a reference to another - hash, which assigns values to all the properties that a database has for - its download and post-processing: - - name - a human-readable pretty-printed name or short description that - makes clear to the world what this database is about. - A bad example is the mere assignment of "DE405", which few people - understand. A better example is "Pfam-A : Manually curated protein - families and domains, only the seed is presented.". One could argue - that one should have that field renamed to "description". - - source - shell commands to perform the initial download and subsequent - updates - Commonly the wget tool is used for download. The such presented - little script is executed underneath the mirrordir directory. One - simple example is "wget --mirror - ftp://ssd.jpl.nasa.gov/pub/eph/export/unix/unxp2[01]*.405". With - increasing proficiency in using wget, one is tempted to substitute - "--mirror" with "--recursive --no-host-directories --no-directories - --level 1 --no-parent". - - post-download - shell commands to perform after the data has been - downloaded. - A simple (and unnecessary when used the right flags to wget) example - is the mere setting of a symbolic link: - - "post-download" => "ln -s ssd.jpl.nasa.gov/pub/eph/export/unix/unxp*.405 ." - - Some more effort has been put into TrEMBL for the merging of - releases with subsequent updates and the indexing for EMBOSS: - - "d=uncompressed; if [ ! -d \$d ]; then mkdir \$d; fi; " - ."rm -rf \$d/trembl.dat; " - ."(find ftp.ebi.ac.uk -name '*.dat.gz' | xargs -r zcat ) > \$d/trembl.dat; " - ."[ -x /usr/bin/dbxflat ] " - . "&& cd \$d && " - . "dbxflat -dbresource embl -dbname trembllocal -idformat swiss -filenames=trembl.dat -fields id,acc -auto", - - The dots are connecting strings in Perl. This helps the readability - of the code. When writing these scripts, please be aware the - newlines don't separate the individual commands here. Semicolon are - required. - - recommends - suggests a series of packages to be present for the use of - the database or the performance of the indexing. - This information is not used at the moment, also to render this - script more useful for other Linux distributions than Debian. - -EXAMPLES - The following will list the identifiers and the descriptions of the - first 4 databases that area available via getData on your system. - - ./getData --mirrordir=/local/databases/mirrored --list | head 4 - - To install any particular database, only give its name as an argument. - If the installation is performed at another directory than the default, - then the --mirrordir needs again to be set. - - ./getData swiss.dat - - To remove the database again, give the script a hint with the --remove - flag - - ./getData --remove swiss.dat - - To perform the indexing only and circumvent the download (attention, - this is dangerous since the index files will look newer than the - database is), do - - ./getData --post swiss.dat - - A special exception to these extra scripts is the --config flag in that - it takes a list of extra arguments. Each shall denote a particular - system that this database may be of interest for. There are today two - systems supported: - -TODO - We now need a mechanism with which packages can specify hooks that shall - be called upon an update of a database. But we cannot assume that every - indexing that can be performed because of the installation of some - package is also desired by the user. How to configure this properly is - left to be decided. - -SEE ALSO - http://debian-med.alioth.debian.org, http://wiki.debian.org/DebianMed, - /etc/getData.conf - -AUTHORS - This script was prepared by Steffen Moeller <[email protected]> and - Charles Plessy <[email protected]> and is distributed under the - terms of the GNU Public License (GPL). On Debian systems, this license - can be found under /usr/share/common-licenses/GPL. - _______________________________________________ debian-med-commit mailing list [email protected] http://lists.alioth.debian.org/mailman/listinfo/debian-med-commit
