Re: [galaxy-dev] Tool shed and datatypes

Jim Johnson Mon, 10 Oct 2011 10:01:41 -0700

There are a number of well defined formats that are exchanged between 
applications, e.g. BAM, gtf, etc,   I wouldn't advocate proliferating those.


I see the need for Toolshed datatypes more for the intermediate file formats 
used within a suite of commands.   These can be helpful in guiding a user to 
select appropriate inputs for successive steps in an analysis.

For example, when developing the 90 some tool wrappers for the mothur 
metagenomic package,  there are many file formats that get passed among the 
mothur commands.   It greatly simplifies the user's experience if the outputs 
are typed so as to correctly filter the acceptable inputs to another command.   
I fear the amount of time I would spend providing user support if the outputs 
and inputs were generically typed.

I'm also seeing a similar need as I am creating creating tool wrappers for the 
GMAP/GSNAP mapping commands.   While input to GSNAP and GMAP can be fastq and 
output in SAM format, some of the more interesting use cases involve creating 
additional map stores, where specific datatypes would guide the user in setting 
the tool parameters correctly.

JJ

James E Johnson
Minnesota Supercomputing Institute, University of Minnesota


On 10/10/11 11:09 AM, Duddy, John wrote:

I agree with the risks you cited.

There is a risk in the other direction that I think is even scarier - without the ability to add data types, 
tool authors may be forced to use a "typeless" system, declaring all inputs/outputs as 
"data" or "text". While this works, it has the same drawbacks as typeless programming 
languages - deferring error detection to runtime, impairing the ability to perform static analysis, inability 
to perform transparent type conversions - in other words, the tools have to take over responsibilities from 
the framework.

Like all interesting problems, I don't think there is an "obviously right" 
answer ;-}

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-----Original Message-----
From: galaxy-dev-boun...@lists.bx.psu.edu 
[mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Paniagua, Eric
Sent: Friday, October 07, 2011 5:53 PM
To: j...@umn.edu; galaxy-dev@lists.bx.psu.edu
Cc: Greg Von Kuster
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hi all,

Just my 2 cents.

This is a really great idea to have dynamically (down-)loadable datatypes, and 
a tool config tag to express a datatype dependency is right on the money.  I 
agree with Greg in having hesitations about adding that feature though.  The 
purpose (at least as far I see it) of the tool shed is to allow the community 
to share its productivity.  New tools written by one group can be used by 
another group that may not have adequate skill, resources, or time to create 
the same tool on their own.  One issue this model can suffer from, however, is 
over-proliferation of contributions.  In this case, new tools with the same, 
overlapping, or very similar functions might be developed independently by 
multiple groups who then want to contribute to the tool shed.  I don't know how 
often this situation arises or what official contingencies are in place to 
manage them, but it is important to manage that situation carefully.  If it 
occurs with any appreciable frequency, then eventually there ar!

e many clusters of tools available that do almost the same thing but not quite. 
 This is bad for the user, bad for the maintainer, complicates communication 
between researchers, etc.  This model can work nicely if the frequency of very 
simliar tool submissions is small, and even better if there is some management 
for cleaning out broken or redundant tools.


When you allow custom datatypes to enter the picture, however, the story can become hairy 
much more quickly.  Having a limited set of officially supplied / supported datatypes 
forces the contributors of new tools to use datatypes drawn from a standard set.  Without 
that constraint, the number of datatype variants could explode.  Now the concern is not 
only that multiple contributors may submit very similar tool variants, or that each of 
them might choose to create their own datatypes to optimize their methods, but also that 
contributors of tools which are functionally dissimilar but manipulate the same general 
types of data will write their tools using new datatypes that are variants of each other. 
 Tools are essentially typed by the datatypes they accept and produce, so you won't be 
able to chain these tools together very easliy at all.  Most pairs of tools will have the 
"wrong" datatype, on input or output, for what a user wants to do.  The general 
trend is then prolifer!

ation of clusters of redundant tools, clusters of redundant datatypes, and growing 
sparsity in the "tool graph" (think of datatypes as vertices and tools as 
directed [hyper]edges).


So, a move in the direction of supporting something like a "TypeShed" would 
require careful consideration consist of at least either a well defined policy for 
managing *Shed rot and capability to execute it or a very slick tool / datatype 
versioning system with flexible control for users and some also very slick method for 
maintaining implicit conversions between the datatypes in a datatype cluster (ideally 
automatically generated).  I think at least the implicit conversion part can be done, 
even if not in a fully automated manner, then by a combination of policy and engineering. 
 For policy, you can define, identify, or construct a canonical datatype in each cluster 
and require that a contributor of a variant datatype submit methods for implicit 
conversion to/from the canonical datatype in that cluster.  One idea that could help 
reduce complexity is to potentially place some additional structure on datatypes and take 
the canonical datatype for a cluster to be a form of th!

e union (mathematical, not the "union" from C) of the variants in the cluster, which 
would simplify implicit conversations somewhat.  Or, if there's some reason for this, there can 
also be a set of "canonical" datatypes for each cluster, so long as they are all 
guaranteed to be mutually implicitly convertible.  For a policy to manage *Shed rot, the most 
direct approach is to moderate and require approval for each submission, but I could imagine that 
responsibility quickly overwhelming the poor team responsible.  Unless I drastically overestimate 
the frequency with which submissions might be made (which is entirely possible), that poor team's 
operations could wind up looking not unlike the USPTO.


Anyway, my general point is that there are many non-trivial factors to consider 
in the question of creating a TypeShed.  But, if done right, the benefits could 
be huge, besides the likely awesomeness of the engineering involved.

Finally, let me echo Greg again, and say to please send additional thought and 
feedback.  What do you think about the points I raised?  What else is there to 
consider that hasn't occurred to me yet?  What would be the benefits and 
potential pitfalls?

Best,
Eric

________________________________________
From: galaxy-dev-boun...@lists.bx.psu.edu [galaxy-dev-boun...@lists.bx.psu.edu] 
on behalf of Jim Johnson [j...@umn.edu]
Sent: Friday, October 07, 2011 2:06 PM
To: galaxy-dev@lists.bx.psu.edu
Cc: Greg Von Kuster
Subject: Re: [galaxy-dev] Tool shed and datatypes

Greg,

It would be great if there were a way to expand upon the core datatypes using 
the ToolShed.

Would it be possible to have a separate datatype repository within the ToolShed?

Datatype
    name=""
    description=""
    datatype_dependencies=[]
    definition=<python code>


The tool config could be expanded to have requirement for datatypes.
     <requirement type="datatype">ssmap</requirement>




Table datatype
     Column    |            Type             |                     Modifiers
-------------+-----------------------------+---------------------------------------------------
   id          | integer                     | not null default 
nextval('datatype_id_seq'::regclass)
   name        | character varying(255)      |
   version     | character varying(40)       |
   description | text                        |
   definition  | text                        |
UNIQUE (name)

Table datatype_datatype_association
     Column    |            Type             |                     Modifiers
-------------+-----------------------------+---------------------------------------------------
   id          | integer                     | not null default 
nextval('datatype_id_seq'::regclass)
   datatype_id | integer                     |
   requires_id | integer                     |
FOREIGN KEY (datatype_id) REFERENCES datatype(id)
FOREIGN KEY (requires_id) REFERENCES datatype(id)


Then for my mothur metagenomics tools I could define:

name="ssmap"   description="Secondary Structure Map"  version="1.0"  
datatype_dependencies=[tabular]
definition=
from galaxy.datatypes.tabular import Tabular
class SecondaryStructureMap(Tabular):
      file_ext = 'ssmap'
      def __init__(self, **kwd):
          """Initialize secondary structure map datatype"""
          Tabular.__init__( self, **kwd )
          self.column_names = ['Map']

      def sniff( self, filename ):
          """
          Determines whether the file is a secondary structure map format
          A single column with an integer value which indicates the row that 
this row maps to.
          check you make sure is structMap[10] = 380 then structMap[380] = 10.
          """
...




Then the align.check.xml tool_config could require the 'ssmap' datatype:

<tool id="mothur_align_check" name="Align.check" version="1.19.0">
   <description>Calculate the number of potentially misaligned 
bases</description>
   <requirements>
     <requirement type="binary">mothur</requirement>
     <requirement type="datatype">ssmap</requirement>
    </requirements>

John,

I've been following this message thread, and it seems it's gone in a direction 
that differs from your initial question about the possibility for Galaxy to 
handle automatic editing of the datatypes_conf.xml file when certain Galaxy 
tool shed tools are automatically installed.  There are some complexities to 
consider in attempting this.  One of the issues to consider is that the work 
for adding support for a new datatype to Galaxy lies outside of the intended 
function of the tool shed.  If new support is added to the Galaxy code base, an 
entry for that new datatype should be manually added to the table at the same 
time.  There may be benefits to enabling automatic changes to datatype entries 
that already exist in the file (e.g., adding a new converter for an existing 
datatype entry), but perhaps adding a completely new datatype to the file may 
not be appropriate.  I'll continue to think about this - send additional 
thought and feedback, as doing so is always helpful

Thanks!

Greg


On Oct 5, 2011, at 11:48 PM, Duddy, John wrote:

One of the things we're facing is the sheer size of a whole human genome at 30x coverage. 
An effective way to deal with that is by compressing the FASTQ files. That works for BWA 
and our ELAND, which can directly read a compressed FASTQ, but other tools crash when 
reading compressed FASTQ filesfiles. One way to address that would be to introduce a new 
type, for example "CompressedFastQ", with a conversion to FASTQ defined. BWA 
could take both types as input. This would allow the best of both worlds - efficient 
storage and use by all existing tools.

Another example would be adding the CASAVA tools to Galaxy. Some of the 
statistics generation tools use custom file formats. To be able to make the use 
of those tools optional and configurable, they should be separate from the 
aligner, but that would require that Galaxy be made aware of the custom file 
formats - we'd have to add a datatype.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy at illumina.com

From: Greg Von Kuster [mailto:greg at bx.psu.edu]
Sent: Wednesday, October 05, 2011 6:25 PM
To: Duddy, John
Cc: galaxy-dev at lists.bx.psu.edu
Subject: Re: [galaxy-dev] Tool shed and datatypes

Hello John,

The Galaxy tool shed currently is not enabled to automatically edit the 
datatypes_conf.xml file, although I could add this feature if the need exists.  
Can you elaborate on what you are looking to do regarding this?

Thanks!


On Oct 5, 2011, at 1:52 PM, Duddy, John wrote:


Can we introduce new file types via tools in the tool shed? It seems Galaxy can 
load them if they are in the datatypes configuration file. Does tool 
installation automate the editing of that file?


John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jduddy at illumina.com

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Greg Von Kuster
Galaxy Development Team
greg at bx.psu.edu

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/




___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-dev] Tool shed and datatypes

Reply via email to