Re: [galaxy-dev] Data Collections

John Chilton Mon, 15 Jun 2015 08:18:17 -0700

On Wed, Jun 10, 2015 at 4:04 PM, Alexander Vowinkel
<[email protected]> wrote:
> Hi Folks,
>
> thank you so far for the previous help. I got much further.
> Now I'm stuck with data collections.
>
> Because this is quite a list, I appreciate also answers to parts of my
> questions ;)
>
> I have two issues:
> A) manual definition of data collections (any type) by user and/or admin
> B) definition of data collections as input/output of a tool and inside a
> workflow
>
>
> A) manual
> Basically I would like to create
> i) a list of fastq files (unpaired)
> ii) a paired set of two fastq files
> iii) a list of each two paired fastq files
>
> How can I do that?
> By using the web app? As user? As admin?
> By working via ssh on the server?


So each of these got much easier/more robust with the most recent release.

For the user perspective - for any of these options you will want to
load the fastq files into a history, open the manage multiple datasets
option 
(https://wiki.galaxyproject.org/Histories#Managing_Multiple_Datasets_Easily),
select the datasets, and then choose the list type from the menu. Each
will cause a widget to pop up allowing you to group the datasets (into
a list, a pair, or a list of pairs  depending on your selection).

The most complicated option is the list of pairs - this option is
demonstrated in a the first video in Anton's recent NGS 101 -
Reference-based RNA-seq series
(https://vimeo.com/channels/884356/128265983). More information at
https://wiki.galaxyproject.org/Learn/GalaxyNGS101.

For all user-centric scenarios - you will need to get the plain
datasets into a history first. FTP upload for instance doesn't support
creating collections directly - you can import datasets and then
create them. Likewise - data libraries do not currently support
dataset collections. I believe there are Trello cards for both of
these issues.

For admins - there is a dataset collection API - I can point you at
examples if you want - but this doesn't seem to be your interest.

>
>
> B) in tool/workflow
> Here I also have different approaches I would like to realize:
> i) use a collection as input for a tool
> ii) create a collection as output of a tool
> ii.1) from known # of output parameters
> ii.2) from unknown # of output parameters
>
> For these things I was trying to find some tools in toolshed to see how they
> do it, but I couldn't quite adopt it.

I would look in the following directory instead of the tool shed -
https://github.com/galaxyproject/galaxy/tree/dev/test/functional/tools.
These are the tools used to drive the testing of the collections
implementation and contain some very stripped down examples of what is
possible.

>
> i) use a collection as input for a tool
> this is good documented - realizable by type="data_collection" and the
> collection_type.
> Unfortunately I can't test this because I can't create a collection so far
> ;) - see A

Indeed :). Here some good examples are like the tools in the RNA-seq
pipeline - Tophat, Bowtie2, etc....

>
> ii) create a collection as output of a tool
> Here it gets blurry for me.

So one can get very far without ever creating an output from a tool
explicitly. I contend most of the time - if you have a list of bam
files and you want to create another list of bam files - you just want
to map some operation over them. This is demonstrated in that RNA-seq
outline - and talked about in a more theoretical way in my GCC talk
from last year http://bit.ly/gcc2014workflows.

There are definitely cases when you want to explicitly create
collections though - the current best documentation on this is going
to be the pull request that added them - not the implementation but
the description which actually lays out these same categories and how
to handle them with explicit complete examples.
https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-to-explicitly-produce-dataset

Hopefully this helps - please follow up with additional questions as
you have them. I am keen to see more developers leveraging dataset
collections.

Thanks a bunch.
-John

>
> ii.1) from known # of output parameters
> Here I didn't find a tool. I just thought, it might be a simpler case than
> ii.2 and
> good to understand the concept.
> I would be glad if someone could explain the way(s) to do this.
>
> ii.2) from unknown # of output parameters
> For this I found barcode splitter tools (also from devteam) that have
> different approaches.
> But. Their output (defined in xml) is only some report file.
> The output files seem to be fed into the history.
> And here I don't know how to get hands on these files when I want to use
> them to feed them into the next step during a workflow.
>
> Help highly appreciated!
>
> Thanks!
> Alexander
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Data Collections

Reply via email to