Good Morning Asta: Please note the three different versions of D. virilis sequences available:
http://hgdownload.cse.ucsc.edu/downloads.html#droVir The sequence for droVir2 is indeed in the file: scaffoldsFa.gz as mentioned in the README file: http://hgdownload.cse.ucsc.edu/goldenPath/droVir2/bigZips/README.txt scaffoldFa.gz - The working draft sequence in one FASTA record per scaffold. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are in lower case while non-repeating sequence is in upper case. For potential gene tracks, take a look at the genome browser display of these genomes: http://genome.ucsc.edu/cgi-bin/hgTracks?db=droVir2 and note the different types of gene tracks. Any file in the database dump directory will be for that genome assembly. If you need information about the structure of a file, look at the corresponding .sql file. Some of the formats will correspond to our standard file formats: http://genome.ucsc.edu/FAQ/FAQformat.html There are many file format converter programs in the kent source tree: http://genome.ucsc.edu/admin/git.html http://genome.ucsc.edu/admin/jk-install.html --Hiram ----- Original Message ----- From: "Asta Laiho" <[email protected]> To: [email protected] Sent: Wednesday, March 23, 2011 6:37:52 AM Subject: [Genome] D.virilis genome & gene annotation files for ngs mapping Hi, I am working with D.virilis SOLiD next-generation sequencing transcriptomics data. To begin, I would need to map the data to the D.virilis reference genome and count the read tags for annotated genes. I looked at the D.virilis files available at: http://hgdownload.cse.ucsc.edu/goldenPath/droVir2/ I would need the genome sequence in fasta format and the gene information in gtf (or gff) format with the coordinates corresponding to the genome fasta file. When I looked at the files available at the UCSC download repository, it unfortunately isn't completely clear for me which of those files would be the best to use, especially since there isn't any documentation that I can find clearly explaining the content of the different files under the 'database' folder. For the genome annotation I got an advice from the SOLiD support person, that probably the 'scaffoldFa.gz' under the 'bigZips' folder would be the best to use as the genome file. I was wondering whether the 'xenoRefGene.txt' file would provide a good gene annotation? But as there isn't any documentation, how can I be sure that the coordinates given in this file for the genes match to the coordinates as indexed in the 'scaffoldFa.gz' genome fasta file? This would be quite hard to check so it would be really helpful if someone would be able to clarify this for me. Still I will of course need to convert this information into gtf (or gff) as this does not seem to be readily available. Thanks in advance for any advice regarding this matter, Greetings, Asta Laiho -- High-throughput Bioinformatics Group Turku Centre for Biotechnology University of Turku, Finland _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
