Hey Steve -

  About sketching out some work on defining output collections this way:

  It took a while but I just got this pull request merged
which includes some fixes and examples for discovering datasets with
galaxy.json. I also added some planemo documentation about this topic
I think there still needs to be more documentation and more examples
but hopefully it is a start if you want to explore using galaxy.json
to implement something that describes output collections.

There are a number of formats that could work - maybe looking for
lines in galaxy.json of the format:

{"output_name": "out1", "collection_type": "list", "elements":
[{"identifier": "a1", "path": "workdir/a1.fastq", "format":
"fastqsanger", "dbkey": "hg19"}, ...]}

Things like format and dbkey could be defaultable on the output
element description and so totally optionally in the output
galaxy.json. "output_name" could reference the output name in the tool
xml to match this entry up to a entry described in the XML.

If you are more just concerned about consuming collections this way -
it might even be easier. I'd add say an index_file="input.json" to the
param element in the tool XML and dump out the structure of the
collection as JSON (maybe mirroring the API description - but with
path elements). Here is a PR that adds some syntax to write json
descriptions out to the tool working directory - it may be a template
to follow.

 - https://github.com/galaxyproject/galaxy/pull/1405


On Wed, Jul 27, 2016 at 2:58 AM, Steve Cassidy <steve.cass...@mq.edu.au> wrote:
> Hi John,
>  thanks for the response. So based on your updated documentation I’ve 
> modified my script to take the identifiers as a second argument and with a 
> bit of juggling I now have the command line:
>         query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" 
> --identifier "${",".join(map(str, [t.element_identifier for t in 
> $textgrid]))}" --tier $tier --regex '$regex' --output_path $output
> Note the juggling needed with list comprehension to get the list of 
> identifiers from the textgrid argument.  This works ok and I can now get a 
> result from my tool that includes the identifier:
> start   end     duration        label   identifier
> 0.59    0.83    0.24    6:      1_1308_1_2_092-ch6-speaker16.TextGrid
> 1.56    1.77    0.21    I@      1_1308_1_2_094-ch6-speaker16.TextGrid
> 1.64    1.87    0.23    3:      1_1308_1_2_173-ch6-speaker16.TextGrid
> in fact I’ll probably take the .TextGrid off the identifier so that it just 
> names the recording I’m working with.
> What I’d like to do now is to write another tool that takes this as input 
> along with another dataset collection who’s elements also have similar (or 
> the same) identifiers but a different type (they will be acoustic features 
> derived from an audio file.  I think I can see how to do this, the input to 
> this tool will be similar to query_textgrids above and I’ll work through the 
> identifiers and the table together.
> I saw your note on the issue re. galaxy.json and took a look in the sources 
> for it, so this is a secret way of communicating dataset metadata back to the 
> system? Sounds like it might be useful.  I may be able to get someone to work 
> on this so if you have time to elaborate your ideas then please go ahead.
> It seems that if I was to write my python script in the .xml file (as 
> cheetah) I’d get access to a bunch of things that are opaque to a separate 
> script. Would it be a useful goal to have a richer galaxy-tool interface that 
> could make all information available to the tool wrapper visible to my Python 
> script? One way to do that would just be to bundle everything up in JSON and 
> send it to the script.
> Again, thanks for the help.
> Steve
>> On 27 Jul 2016, at 12:29 AM, John Chilton <jmchil...@gmail.com> wrote:
>> Thanks for the questions - I have tried to revise the planemo docs to
>> be more explicit about what collection identifiers are and where they
>> come from 
>> (https://github.com/galaxyproject/planemo/commit/a811e652f23d31682f862f858dc792c1ef5a99ce).
>> http://planemo.readthedocs.io/en/latest/writing_advanced.html#collections
>> I think this might be a case where I'm too close the problem - I
>> implemented collections, the tooling around them, and planemo docs so
>> there is probably a lot that I just assume is implicit when it is
>> completely non-obvious.
>> The collection identifier in your case is going to be something like:
>> 1_1308_1_2_092-ch6-speaker16. The designation in the previous step -
>> the outputing a collection with discovered datasets - if it is
>> producing a collection should actually be called "identifier". The
>> terms "desigination" and "identifier" are inter-changable from a
>> Galaxy perspective - but I prefer using the term "identifier" for
>> collections and the older "desigination" when discovered un-collected
>> individual datasets.
>> There was a little warning explaining the odd whitespace replacement
>> stuff that got shifted around at some point in the planemo docs - I
>> think I have corrected that now. The explanation for fixing up the
>> identifier was this:
>> "Here we are rewriting the element identifiers to assure everything is
>> safe to put on the command-line. In the future, collections will not
>> be able to contain keys that are potentially harmful and this won't be
>> necessary."
>> So yes this is the name you are after.
>> As for the quesstion, "Is a manifest-based approach a silly idea?" -
>> not at all, not in the least. I'd prefer to have both options
>> available - this current option is nice because it doesn't require a
>> "wrapper" script - you can build command lines and such from the
>> cheetah template - but definitely people already working with and
>> thinking about collections from inside some sort of script should have
>> the option to consume and produce manifests of files.
>> I've created an issue for this idea here -
>> https://github.com/galaxyproject/galaxy/issues/2658. I'm not sure if
>> I'll have time to get to it anytime soon - but if you or someone else
>> is eager to tackle the problem I could scope out an implementation
>> plan for this.
>> Thanks for the e-mail and I hope this helps,
>> -John
>> On Tue, Jul 26, 2016 at 1:59 AM, Steve Cassidy <steve.cass...@mq.edu.au> 
>> wrote:
>>> Hi all,
>>>   I’m staring at the discussion of handling dataset collections:
>>> http://planemo.readthedocs.io/en/latest/_writing_collections.html
>>> but failing to see the solution to my problem.
>>> I have a tool that creates a dataset collection, a group of files with names
>>> like 1_1308_1_2_092-ch6-speaker16.TextGrid where the 1_1308_1_2_092 part is
>>> a unique identifier that I’d like to keep track of.  I’ve used a
>>> discover_datasets tag in the tool xml file to match my output filenames and
>>> extract the designation (1_1308_1_2_092-ch6-speaker16.TextGrid) and the ext
>>> (TextGrid).
>>> I have another tool that runs a query over these files and generates a
>>> single tabular result that will ideally include the identifier in some form.
>>> Here’s the command section for that tool:
>>>        query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}"
>>> --tier $tier --regex '$regex' --output_path $output
>>> where ‘$textgrid’ is one of my input parameters that has multiple=“true” set
>>> so that it can be a dataset collection.  That works ok but the input I get
>>> are the filenames (dataset_1.dat, etc.) not the name of the datasets.
>>> The page above mentions something called the ‘element_identifier’ and gives
>>> this funky example:
>>> merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}"
>>> --file "$input" --to $output;
>>> I can’t see what this element_identifier thing is - the suggestion is that
>>> it might be the dataset name, but I’m not sure.  Also I don’t understand why
>>> the command above is doing replacement of whitespace with underscores.
>>> If this is the name I’m after, it would seem that I’d need to pass these
>>> names along with the textgrid files and then pair them up inside my script -
>>> is that what I need to do?
>>> All of this cries out to me for a more explicit representation of a dataset
>>> collection that my tool can create and consume rather than this hacky
>>> treatment of filenames.   If I could generate a manifest file of some kind
>>> describing my dataset collection then none of this parsing of filenames
>>> would be needed.  I could also consume the manifest file as well and it
>>> could be used for collection level metadata.  Is this a silly idea?
>>> Anyway, any help with my immediate problem would be appreciated.
>>> Thanks,
>>> Steve
