On Thu, Oct 25, 2012 at 10:00 AM, Jorrit Boekel
<jorrit.boe...@scilifelab.se> wrote:
> On 10/25/2012 10:54 AM, Peter Cock wrote:
>>
>> On Thu, Oct 25, 2012 at 9:36 AM, Jorrit Boekel
>> <jorrit.boe...@scilifelab.se> wrote:
>>>
>>> Dear list,
>>>
>>> In my galaxy fork, I extensively use the job splitters. Sometimes though,
>>> I
>>> have to split to different file types for the same job. That raises an
>>> exception in the lib/galaxy/jobs/splitters/multi.py module.
>>>
>>> I have turned this behaviour off for my own work, but am now wondering
>>> whether this is very bad practice. In other words, does somebody know why
>>> the multi splitter does not support multiple file type splitting?
>>>
>>> cheers,
>>> jorrit
>>
>> Could you clarify what you mean by showing some of your tool's XML
>> file. i.e. How is the input and its splitting defined.
>>
>> Are you asking about splitting two input files at the same time?
>>
>> Peter
>
>
> Hi Peter,
>
> Something like the following:
>
>  <command interpreter="python">bullseye.py $hardklor_results
> $ms2_in.extension $ms2_in $output $use_nonmatch</command>
>  <parallelism method="multi" split_inputs="hardklor_results,ms2_in"
> shared_inputs="config_file" split_mode="from_composite"
> merge_outputs="output"/>
>  <inputs>
>
> The tool takes two datasets of different formats, which are to be split in
> the same amount of files, which belong together as pairs.

So the inputs are $hardklor_results and $ms2_in (which should be split
in a paired manor) and there is one output $output to merge?

What is shared_inputs="config_file" for as that isn't in the
<command> tag anywhere.

> Note that I have implemented an odd way of splitting, which is from a number
> of files in the dataset.extra_files_path to symlinks in the task working
> dirs. The number of files is thus equal to the number of parts resulting
> from a split, and I have ensured that each part is paired correctly. I
> assume this hasn't been necessary in the genomics field, but for proteomics,
> at least in our lab, multiple-file datasets are the standard.
>
> My fork is at http://bitbucket.org/glormph/adapt if you want to check more
> closely.

I don't quite follow your example, but I can see some (simpler?) cases
for sequencing data - paired splitting of a FASTA + QUAL file, or
paired splitting of two FASTQ files (forward and reverse reads). Here
the sequence files can be broken up into any size (e.g. split in four,
or divided into batches of 10000, but not split based on size on disk),
as long as the pairing is preserved.

i.e. Given FASTA and QUAL for read1, read2, ...., read100000 then
if the FASTA file is split into read1, read2, ...., read1000 as the first
chunk, then the first QUAL chunk must also have the same one
thousand reads.

(In these examples the pairing should be verifiable via the read
names, so errors should be easy to catch - I don't know if you have
that luxury in your situation).

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to