Dear Hiram Many thanks for the detailed information. We will test your recommended commands and give you feedback.
Best regards Jacques van Helden Université d'Aix-Marseille (AMU). Lab. Technological Advances for Genomics and Clinics (TAGC) INSERM Unit U1090, 163, Avenue de Luminy, 13288 MARSEILLE cedex 09. France Fax: +33 4 91 82 87 01 Web: http://jacques.van-helden.perso.luminy.univmed.fr/ Email: [email protected] On 20 Mar 2012, at 19:17, Hiram Clawson wrote: > Good Morning Jacques: > > The data files for all multiple alignments is currently just > under 1 Tb in size (uncompressed). The best way for you to access that data > in an efficient manner is to actually have the .maf files at > your site and use the maf selection tools from the kent source > code to extract information from those files. It would be > very difficult to access this information via the DAS or > table browser interface due to the immense amount of data in the > answer sets and the processing time to extract an answer. > > There are several mechanisms you can use to obtain the maf > files for local use. The rsync server at hgdownload can be > used to obtain a list of files. For example, to obtain > a list of the uncompressed maf files used by the genome browser: > > rsync -navP --exclude 'genbank/' rsync://hgdownload.cse.ucsc.edu/gbdb/ 2>&1 \ > | grep multiz | grep -v "^d" | egrep 'maf$' > /tmp/gbdb.maf.file.list > > Alternatively, the gzipped compressed maf files from the goldenPath downloads: > > rsync -navP rsync://hgdownload.cse.ucsc.edu/goldenPath/ 2>&1 \ > | grep "multiz" | grep "maf.gz" | grep -v upstream > > /tmp/goldenPath.maf.gz.file.list > > To select the file names from those listings: > > awk '{print $NF}' /tmp/goldenPath.maf.gz.file.list > /tmp/fetch.maf.list > > And then to transfer just those files: > > rsync -avP --files-from=/tmp/fetch.maf.list > rsync://hgdownload.cse.ucsc.edu/goldenPath/ ./ > > The hierarchy of those files will be constructed in ./ > > You can now work directly with the maf files to answer all questions about > the alignment, > for example, extract a list of species in the alignment: > > mafSpeciesList file.maf.gz stdout > > Note the maf utilities in the kent source tree: > > mafAddIRows mafAddQRows mafCoverage mafFetch mafFilter > mafFrag mafFrags mafGene mafMeFirst mafOrder mafRanges > mafSpeciesList mafSpeciesSubset mafSplit mafSplitPos > mafToAxt mafToPsl mafsInRegion > > --Hiram > > Jacques van Helden wrote: >> Dear UCSC team, >> First of all , thank you very much for developing and maintaining the UCSC >> Genome Browser, which is a great resource for all the community. We >> developed, since 1997, a software suite called Regulatory Sequence Analysis >> Tools (RSAT, http://rsat.ulb.ac.be/rsat/). For a list of supported >> functionalities, see http://www.ncbi.nlm.nih.gov/pubmed/18495751 >> and the 2011 update >> http://www.ncbi.nlm.nih.gov/pubmed/21715389 >> We recently developed a new tool called peak-motifs, which detects >> transcription factor binding motifs in full collections of ChIP-seq peaks. >> http://www.ncbi.nlm.nih.gov/pubmed/22156162 >> We are now extending the approach to analyze conserved motifs under the >> peaks. We are currently using the MAF files produced by multiz, but this >> requires for us to maintain a local copy of all the multiz alignemnts, which >> poses problems of consistency with updates of supported genomes. >> We would thus like to establish a programmatic connection to UCSC Genome >> Browser, in order to dynamically retrieve multi-genome alignments of the >> conserved regions covered by a set of peaks (more generally, we would like >> to obtain the MAFs under a set of genomic coordinates specified as a bed >> file). We already saw how to use your DAS interface for retrieving >> single-organism sequences under the peaks, but we did not find the >> equivalent for retrieving the MAFS and the related taxonomic information. >> Could you indicate us if there is a programmatic access to UCSC (DAS, >> SOAP/WSDL, Perl modules, Python modules or anything else) that would allow >> us to do the following queries ? >> 1) Return the list of organisms for which a multi-z alignment is available. >> Currently, we must first get (with DAS) the list of all supported organisms, >> and then send one request for each organism in order to know if it contains >> one or several multizNway attributes). >> 2) Given the name of a reference organism, obtain the list of other >> organisms aligned with its genome in the multizNway alignments (the list >> varies from organism to organism). >> 3) Given a clade, obtain the list of included organisms. 4) Given a set of >> genomic coordinates (bed file), retrieve the subset of MAFs intersecting >> these coordinates. >> It would be event better if the method would allow the client to specify a >> subset of organisms for which the aligned sequences would be returned (much >> in the same way as the UCSC viewer allows to select a subset of organisms to >> be displayed in the multiz track). Many thanks for your help >> Pr. Jacques van Helden >> Université d'Aix-Marseille (AMU). Lab. Technological Advances for Genomics >> and Clinics (TAGC) >> INSERM Unit U1090, 163, Avenue de Luminy, 13288 MARSEILLE cedex 09. France >> Fax: +33 4 91 82 87 01 >> Web: http://jacques.van-helden.perso.luminy.univmed.fr/ >> Email: [email protected] _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
