Hi Veranja,
I am going to try to address all questions in one go since they are all
in the same thread. Next time though, it would be best send new
questions as a brand new question, not as a reply with just the subject
line changed. This helps us greatly with tracking and other users when
searching prior posts.
In the first email you seemed to have some trouble with the format of
your custom reference genome, but later in the second email this seems
to be resolved, at least as far as format is concerned (SAM->BAM
conversion is possible using this genome, in Galaxy?). I am going to
point you to our help for custom reference genomes, and if you click
through to the main page there is a table with detailed format
troubleshooting help. But, I will tell you first that I do not believe
that this is going to be helpful for your overall goals, if I am
understanding correctly.
But, here is the link:
http://wiki.galaxyproject.org/Support#Custom_reference_genome
Your reference genome sounds as if it is not really a reference genome
but instead more of a collection of short read sequences? If this number
is very large, and the sequences are very short, you will likely run
into memory or related indexing problems with many tools. There really
isn't an easy way around this. You could try taking the analysis to a
cloud version of Galaxy and scaling up the memory to see if that helps.
You also might try breaking the job up into smaller jobs - you mentioned
that the data is from multiple genomes - perhaps split by genome. But
you will have to test this - I don't know the actual profile of your
data. I can let you know that using purely a short read dataset, in
particular one that has redundancy, will be problematic, likely no
matter what is attempted. Some assembly or other strategy is likely
required to move forward.
Galaxy CloudMan:
http://usegalaxy.org/cloud
For the last question, different tools are probably expected to vary a
bit in the results since they use a different method. If you want to
compare datasets, using identifiers would be a good way. Convert the
files to tabular, cut out the identifiers, compare these to find
differences, then adjust the tabular files as needed, and convert back
to fastq/fasta. Tools to do these sorts of functions are in the tool
groups "Text Manipulation", "FASTA manipulation", "Filter and Sort, and
Join", "Subtract and Group", "NGS: QC and manipulation". I know that
seems like a lot of places to look - but use the tool search at the top
of the tool panel and search by data type or tool name to make finding
these easier, for example "Cut" or "Join" or "Tabular" - these tools
have the names you would probably expect them to have and tool help is
directly on each form. Our 101 tutorial also would be a good
introduction for an overview: https://main.g2.bx.psu.edu/u/aun1/p/galaxy101
Hopefully this gives you some helpful information to work with,
Jen
Galaxy team
On 4/8/13 7:21 PM, Veranja Liyanapathirana wrote:
Dear all,
I was using the barcode splitter on Miseq paired end reads, however I
am not sure if I did it correctly as the results I get in terms of the
number of reads alocated per each barcode does not tally with the
resutls obtained by the our service provider by one of their in-house
script based methods. I use it for splitting some inhouse barcodes. I
need to make sure that read 1 and read 2 are split in to the same
group, and drop the sequences where this criteria is not met. Not sure
how to get about doing this. Would using FASTQ joiner on the two reads
and subsequent splitting work?
Thank you,
Kind Regards,
Veranja
*From:* Veranja Liyanapathirana <[email protected]>
*To:* galaxy-user <[email protected]>
*Sent:* Saturday, 6 April 2013, 23:13
*Subject:* Error in creating Depth of Coverage files after Bowtie for
Illumina alignment
Dear Galaxy team/ users,
I am sorry to spam the thread again but I still could not figure out
what is worng with my work flow and need some help.
As mentioned earlier, I use Miseq reads, demultiplex for an inhouse
barcode using barcode splitter, re-upload and map with a ref sequence
that is consisting of multiple short reference sequences. The work
flow goes well up to this stage, conversion from SAM to BAM after
filtering the SAM files also fine but I can not use the GATK depth of
coverage tool to get the alignment data or create pileups. An error
comes up in all instances.
I would really appreciate any inputs in to this.
Thanks a lot,
Veranja Liyanapathirana
Graduate Student (Microbiology)
*From:* Veranja Liyanapathirana <[email protected]>
*To:* galaxy-user <[email protected]>
*Sent:* Thursday, 4 April 2013, 6:39
*Subject:* Using segments of sequences as a reference genome - Bowtie
for Illumina
Dear all,
My problem seems like something that should have a very simple
solution from my end and due to my lack of knowledge in
bioinformatics, I am probably messing up with the workflows. The
experiment I run is one where we used Miseq to sequence amplicons of a
multiplex PCR. We introduced an inhouse barcodeto our PCR products via
an adaptor.
Miseq data was demultiplexed for the Illumina barcodes using Miseq
reporter on intrument software by our service provider and I am trying
to run the rest of the process on Galaxy web port with no command
prompt programming.
The data for R1 and R2 was imported, and then I used barcode splitter
to de-multiplex the amplicons after quality triming. (I did not use
FASTQ groomer as Miseq data is supposed to be Sanger FastQ than
Illumina).
Then the sequence trimmer was used to trim the barcode+adaptor
sequences. The results of this were re-uploaded and designated as
FASTQ for alignment.
Now for the reference genome, as our aplicons are of from different
sequences, we have segmented FASTA sequences in one file with
different FASTA identifiers. When this file was input as the reference
genome and mapping was performed using Bowtie for Illumina, the
mapping went on with no errors.
I could filter the alignment file using SAM filters too. But I can not
do any more downstream visualozations, not even SAM to BAM conversion.
I suspect that this may be due to an error in the way that the
reference genome was formulated but can not get around to figure it
out. I would be extremely grateful if you could help me with this
issue. I tihnk if I string together the sequences as one it would
work, but converting this back for interpretation becomes an issue then.
Thank you,
Kind Regards,
Veranja
Veranja Liyanapathirana
Graduate Student (Microbiology)
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/