Hello,

I am writing some code in Galaxy for splitting bams. Up to know, I am
following the ideas that Marco Albuquerque proposed in this thread
http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html
.
He proposed three ways of splitting:

1) by_rname -> splits the bam into files based on the chromosome
2) by_interval -> splits the bam into files based on  a defined bp length,
and does so across the entire genome present in the BAM file
3) by_read -> splits the bam into files based on the number of reads
encountered (if multiple files, all other files match the interval as the
first)

As I think the easiest is the first one, I started with this option.

First of all , I had to change line 82 of
lib/galaxy/jobs/splitters/multi.py as that "if" didn't let the code to
continue (I talked this in another thread).

Next, I had to do some changes in lib/galaxy/datatypes/binary.py. I added a
method "split" that creates the json for the script
extract_dataset_parts.sh. Here, in the next code you can see that I call
samtools -H in order to get the chromosome names,
now I realized that I can get that information directly from metadatas in
the input_datasets variable, so in the future I will change this.

def split(cls, input_datasets, subdir_generator_function, split_params):

        # 1) by_rname -> splits the bam into files based on the chromosome
        # 2) by_interval -> splits the bam into files based on  a defined
bp length, and does so across the entire genome present in the BAM file
        # 3) by_read -> splits the bam into files based on the number of
reads encountered (if multiple files, all other files match the interval as
the first)

        if split_params is None:
            return
        if len(input_datasets) > 1:
            raise Exception("BAM file splitting does not support multiple
files")
        input_file = input_datasets[0].file_name


        if 'split_mode' not in split_params:
            raise Exception('Tool does not define a split mode')
        elif split_params['split_mode'] == 'by_rname':
            log.debug("Attemping to split BAM file %s by chromosome",
input_file)

            #First get bam header
            params = ["samtools", "view", "-H", input_file]
            output = subprocess.Popen( params, stderr=subprocess.PIPE,
stdout=subprocess.PIPE ).communicate()[0]
            output = output.split("\n")
            chrList = []
    #Get chromosome list from the header.
            for line in output:
                fields = line.strip().split("\t")
                if fields[0].startswith("@SQ") and
fields[1].startswith("SN:"):
                    chrList.append(fields[1].split("SN:")[1])

    # Write json for extract_dataset_parts
            for chrName in chrList:
                try:
                    part_dir = subdir_generator_function()
                    base_name = os.path.basename(input_file)
                    part_path = os.path.join(part_dir, base_name)
                    split_data = dict(class_name='%s.%s' % (cls.__module__,
cls.__name__),
                                      output_name=part_path,
                                      input_name=input_file,
                                      args=dict(chromosome=chrName))
                    f = open(os.path.join(part_dir, 'split_info_%s.json' %
base_name), 'w')
                    json.dump(split_data, f)
                    f.close()

                except Exception, e:
                    log.error("Error: " + str(e))
                    raise
        else:
            raise Exception('Unsupported split mode %s' %
split_params['split_mode'])
    split = classmethod(split)

Well, this works correctly and writes the json as expected. Now I have to
write the code that is called by scripts/extract_dataset_part.py (inside of
extract_dataset_parts.sh) "cls.*process_split_file*(data)".
So I created the next two function in the Bam class:

def *process_split_file*(data):
        """
        This is called in the context of an external process launched by a
Task (possibly not on the Galaxy machine)
        to create the input files for the Task. The parameters:
        data - a dict containing the contents of the split file
        """
        args = data['args']
        input_name = data['input_name']
        output_name = data['output_name']
        chromosome = args['chromosome']

        commands = Bam.get_split_commands_chromosome(input_name,
output_name, chromosome)
        for cmd in commands:
            if 0 != os.system(cmd):
                raise Exception("Executing '%s' failed" % cmd)
        return True
process_split_file = staticmethod(process_split_file)

def get_split_commands_chromosome(input_name, output_name, chromosome):
        params = ["samtools view -h " + input_name + " " + output_name + "
" + chromosome]
        return params
get_split_commands_chromosome = staticmethod(get_split_commands_chromosome)

Which is my problem? That I need the .bai related with that dataset
"input_name", I think is in a metadata table, but I don't know how to get
it, Could you please help me with this?
In any case, if you find that I am doing something wrong, or you have a
better idea of implementing this, please don't hesitate to contact me.

Best regards



-- 
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: ralo...@cipf.es
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to