Introduction to Bioinformatics using LINUX

http://www.prstatistics.com/course/introduction-to-bioinformatics-using-
linux-ibul02/

Instructor: Dr. Martin Jones

This course will run from 16th - 20th October at SCENE (the Scottish Centre 
for Ecology and the Natural Environment), Loch Lomond National Park, 
Glasgow.

Course overview: Most high-throughput bioinformatics work these days takes 
place on the Linux command line. The programs which do the majority of the 
computational heavy lifting ― genome assemblers, read mappers, and 
annotation tools ― are designed to work best when used with a command-line 
interface. Because the command line can be an intimidating environment, 
many biologists learn the bare minimum needed to get their analysis tools 
working. This means that they miss out on the power of Linux to customize 
their environment and automate many parts of the bioinformatics workflow. 
This course will introduce the Linux command line environment from scratch 
and teach students how to make the most of its tools to achieve a high 
level of productivity when working with biological data.

Availability: 15 places total.

Course programme
Monday 16th �C Classes from 09:00 to 17:00 (approximately)
● Session 1 - The design of Linux
In the first session we briefly cover the design of Linux: how is it 
different from Windows/OSX and how is it best used? We'll then jump 
straight onto the command line and learn about the layout of the Linux 
filesystem and how to navigate it. We'll describe Linux's file permission 
system (which often trips up beginners), how paths work, and how we 
actually run programs on the command line. We'll learn a few tricks for 
using the command line more efficiently, and how to deal with programs that 
are misbehaving. We'll finish this session by looking at the built in help 
system and how to read and interpret manual pages.

● Session 2 - System management
We'll first look at a few command line tools for monitoring the status of 
the system and keeping track of what's happening to processor power, 
memory, and disk space. We'll go over the process of installing new 
software from the built in repositories (which is easy) and from source 
code downloads (which is trickier). We'll also introduce some tools for 
benchmarking software (measuring the time/memory requirements of processing 
large datasets).

Tuesday 17th - Classes from 09:00 to 17:00 (approximately)

● Session 3 - Manipulating tabular data
Many data types we want to work with in bioinformatics are stored as 
tabular plain text files, and here we learn all about manipulating tabular 
data on the command line. We'll start with simple things like extracting 
columns, filtering and sorting, searching for text before moving on to more 
complex tasks like searching for duplicated values, summarizing large 
files, and combining simple tools into long commands.

● Session 4 - Constructing pipelines
In this session we will look at the various tools Linux has for 
constructing pipelines out of individual commands. Aliases, shell 
redirection, pipes, and shell scripting will all be introduced here. We'll 
also look at a couple of specific tools to help with running tools on 
multiple processors, and for monitoring the progress of long running tasks.

Wednesday 18th - Classes from 09:00 to 17:00 (approximately)

● Session 5 �C EMBOSS
EMBOSS is a suite of bioinformatics command-line tools explicitly designed 
to work in the Linux paradigm. We'll get an overview of the different 
sequence data formats that we might expect to work with, and put what we 
learned about shell scripting to biological use by building a pipeline to 
compare codon usage across two collections of DNA sequences.

● Session 6 �C Using a Linux server
Often in bioinformatics we'll be working on a Linux server rather than our 
own computer― typically because we need access to more computing power, or 
to specialized tools and datasets. In this session we'll learn how to 
connect to a Linux server and how to manage sessions. We'll also consider 
the various ways of moving data to and from a server from your own 
computer, and finish with a discussion of the considerations we have to 
make when working on a shared computer.

Thursday 19th - Classes from 09:00 to 17:00 (approximately)

● Session 7 �C Combining methods
In the next two sessions ― i.e. one full day ― we'll put everything we have 
learned together and implement a workflow for next-gen sequence analysis. 
In this first session we'll carry out quality control on some paired-end 
Illumina data and map these reads to a reference genome. We'll then look at 
various approaches to automating this pipeline, allowing us to quickly do 
the same for a second dataset.

● Session 8 �C Combining methods
The second part of the next-gen workflow is to call variants to identify 
SNPs between our two samples and the reference genome. We'll look at the 
VCF file format and figure out how to filter SNPs for read coverage and 
quality. By counting the number of SNPs between each sample and the 
reference we will try to figure out something about the biology of the two 
samples. We'll attempt to automate this analysis in various ways so that we 
could easily repeat the pipeline for additional samples.

Friday 20th - Classes from 09:00 to 16:00 (approximately)

● Session 9 �C Customization
Part of the Linux design is that everything can be customized. This can be 
intimidating at first but, given that bioinformatics work is often fairly 
repetitive, can be used to good effect. Here we'll learn about environment 
variables, custom prompts, soft links, and ssh configuration ―  a 
collection of tools with modest capabilities, but which together can make 
life on the command line much more pleasant. In this last session there 
will also be time to continue working on the next-gen sequencing pipeline.

The afternoon of Friday 20th is reserved for finishing off the next-gen 
workflow exercise, working on your own datasets, or leaving early for 
travel.

Please send inquiries to [email protected] or visit the website 
www.prstatistics.com

Please feel free to distribute this information anywhere you think suitable.

Upcoming courses - email for details [email protected]

1.      ADVANCED PYTHON FOR BIOLOGISTS (February 2017) #APYB
http://www.prstatistics.com/course/advanced-python-biologists-apyb01/

2.      STABLE ISOTOPE MIXING MODELS USING SIAR, SIBER AND MIXSIAR USING R 
(February 2017) #SIMM
http://www.prstatistics.com/course/stable-isotope-mixing-models-using-r-
simm03/

3.      NETWORK ANAYLSIS FOR ECOLOGISTS USING R (March 2017) #NTWA
http://www.prstatistics.com/course/network-analysis-ecologists-ntwa01/

4.      ADVANCES IN MULTIVARIATE ANALYSIS OF SPATIAL ECOLOGICAL DATA (April 
2017) #MVSP
http://www.prstatistics.com/course/advances-in-spatial-analysis-of-
multivariate-ecological-data-theory-and-practice-mvsp02/

5.      INTRODUCTION TO STATISTICS AND R FOR BIOLOGISTS (April 2017) #IRFB
http://www.prstatistics.com/course/introduction-to-statistics-and-r-for-
biologists-irfb02/

6.      ADVANCING IN STATISTICAL MODELLING USING R (April 2017) #ADVR
http://www.prstatistics.com/course/advancing-statistical-modelling-using-r-
advr05/

7.      GEOMETRIC MORPHOMETRICS USING R (June 2017) #GMMR
http://www.prstatistics.com/course/geometric-morphometrics-using-r-gmmr01/

8.      MULTIVARIATE ANALYSIS OF SPATIAL ECOLOGICAL DATA (June 2017) #MASE
http://www.prstatistics.com/course/multivariate-analysis-of-spatial-
ecological-data-using-r-mase01/

9.      TIME SERIES MODELS FOR ECOLOGISTS USING R (JUNE 2017 (#TSME)

10.     BIOINFORMATICS FOR GENETICISTS AND BIOLOGISTS (July 2017) #BIGB
http://www.prstatistics.com/course/bioinformatics-for-geneticists-and-
biologists-bigb02/

11.     SPATIAL ANALYSIS OF ECOLOGICAL DATA USING R (August 2017) #SPAE
http://www.prstatistics.com/course/spatial-analysis-ecological-data-using-r-
spae05/

12.     ECOLOGICAL NICHE MODELLING (October 2017) #ENMR
http://www.prstatistics.com/course/ecological-niche-modelling-using-r-
enmr01/

13.     INTRODUCTION TO BIOINFORMATICS USING LINUX (October 2017) #IBUL
http://www.prstatistics.com/course/introduction-to-bioinformatics-using-
linux-ibul02/

14.     STRUCTURAL EQUATION MODELLING FOR ECOLOGISTS AND EVOLUTIONARY 
BIOLOGISTS (October 2017) #SEMR

15.     GENETIC DATA ANALYSIS USING R (October 2017 TBC) #GDAR

16.     LANDSCAPE (POPULATION) GENETIC DATA ANALYSIS USING R (November 2017 
TBC) #LNDG
http://www.prstatistics.com/course/landscape-genetic-data-analysis-using-r-
lndg02/

17.     APPLIED BAYESIAN MODELLING FOR ECOLOGISTS AND EPIDEMIOLOGISTS 
(November 2017) #ABME
http://www.prstatistics.com/course/applied-bayesian-modelling-ecologists-
epidemiologists-abme03/

18.     INTRODUCTION TO METHODS FOR REMOTE SENSING (November 2017) #IRMS

19.     INTRODUCTION TO PYTHON FOR BIOLOGISTS (November 2017) #IPYB
http://www.prstatistics.com/course/introduction-to-python-for-biologists-
ipyb04/

20.     DATA VISUALISATION AND MANIPULATION USING PYTHON (December 2017) 
#DVMP
http://www.prstatistics.com/course/data-visualisation-and-manipulation-
using-python-dvmp01/

21.     ADVANCING IN STATISTICAL MODELLING USING R (December 2017) #ADVR
http://www.prstatistics.com/course/advancing-statistical-modelling-using-r-
advr07/

22.     INTRODUCTION TO BAYESIAN HIERARCHICAL MODELLING (January 2018) #IBHM
http://www.prstatistics.com/course/introduction-to-bayesian-hierarchical-
modelling-using-r-ibhm02/

23.     PHYLOGENETIC DATA ANALYSIS USING R (TBC) #PHYL


Oliver Hooker PhD.
PR statistics

3/1
128 Brunswick Street
Glasgow
G1 1TF

+44 (0) 7966500340

www.prstatistics.com
www.prstatistics.com/organiser/oliver-hooker/

Reply via email to