Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-07 Thread Kelly Vincent
That opening-the-entire-file inefficiency in our sam_to_bam.py was  
fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has  
used getsize since then. It's really just intended to catch weird  
errors that don't throw an actual error (also, I think that if the sam  
file has no header, no info would be output into the bam file). It  
seems somewhat more informative to tell the user the file is empty  
rather than just successfully output an empty file.


Kelly


On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:

Just another example why python's misleadingly simple idioms are  
quite dangerous in production code (couldn't help myself from  
teasing about python... sorry about that).


Seems like line 150 in sam_to_bam.py tries to read the entire BAM  
file into memory just to find out if it's empty or not...


As a stop gap solution with minimal changes, change line 150 from:
   if len( open( tmp_aligns_file_name ).read() ) == 0:
to
   if len( open( tmp_aligns_file_name ).read(10) ) == 0:

Which will read up to the first 10 bytes (instead of the entire file).

A slightly better (but still wrong) solution is to simply check the  
file size, with:

   if os.path.getsize(tmp_aligns_file_name) == 0:

But it's still wrong because even an invalid sam file will create a  
non-empty BAM file (when using samtools view -bt) - the BAM file  
will still contain the chromosome names and sizes.


Example:

$ cat mm9.fa.fai
chr1197195432   6   50  51
chr10   129993255   201139354   50  51
chr11   121843856   333732482   50  51
chr12   121257530   458013223   50  51
chr13   120284312   581695911   50  51
chr13_random400311  704385924   50  51
chr14   125194864   704794249   50  51
chr15   103494974   832493018   50  51
...
...

$ cat 1.sam
Hello World
This is not a SAM file

$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam
[sam_header_read2] 35 sequences loaded.
[sam_read1] reference 'This is not a SAM file' is recognized as '*'.
[main_samview] truncated file.

$ ls -l 1.*
-rw-r--r-- 1 gordon hannon 348 Apr  7 00:57 1.bam
-rw-r--r-- 1 gordon hannon  35 Apr  7 00:57 1.sam



So in short, this whole sam-to-bam wrapper tool is not suitable for  
large SAM files (if they don't fit entirely in memory), and not for  
error checking of invalid SAM files.



-gordon


On 04/07/2011 12:30 AM, Ryan Golhar wrote:

Here's what I get:

(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh
Samtools Version: 0.1.14 (r933:170)
Traceback (most recent call last):
File /home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py, line  
150,

in __main__
if len( open( tmp_aligns_file_name ).read() ) == 0:
MemoryError
Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
(galaxy_env)[galaxy@vail pbs]$



On 4/6/11 7:29 PM, Assaf Gordon wrote:

Ryan,

Since we're shooting in the dark here, best to try and understand
what's the exception.

Add the following line to the beginning of sam_to_bam.py:
import traceback

and add the following line to sam_to_bam.py line 156 (before the
call to stop_err):
traceback.print_exc()

Hopefully this will print out which exception you're getting, and
where is it thrown from.

-gordon


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-07 Thread Assaf Gordon
Kelly Vincent wrote, On 04/07/2011 11:35 AM:
 That opening-the-entire-file inefficiency in our sam_to_bam.py was 
 fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has 
 used getsize since then.

Fixed is one thing, pushed to dist is another :)

$ hg clone https://bitbucket.org/galaxy/galaxy-dist galaxy_test2
requesting all changes
adding changesets
adding manifests
adding file changes
added 5071 changesets with 21893 changes to 4940 files
updating to branch default
3256 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ cd galaxy_test2/
$ hg tip
changeset:   5070:ca0c4ad2bb39
tag: tip
user:Greg Von Kuster g...@bx.psu.edu
date:Wed Feb 16 10:14:24 2011 -0500
summary: When rendering the contents of a data library, make sure a 
LibraryDataset has an associated LibraryDatasetDatasetAssociation.  There 
should always be an ldda, but some users running their own instances have 
reported that some of their LibraryDatasets are missing these associations.
$ sed -n '148,152p' tools/samtools/sam_to_bam.py 
if returncode != 0:
raise Exception, stderr
if len( open( tmp_aligns_file_name ).read() ) == 0:
raise Exception, 'Initial BAM file empty'
except Exception, e:

It only exists in galaxy-central, not in galaxy-dist which is what you 
recommend people to use.

Both me and Ryan have this line of code, and I'm sure we've both pulled since 
April 2010.

 It's really just intended to catch weird errors that don't throw an
 actual error (also, I think that if the sam file has no header, no
 info would be output into the bam file).

Nope - look at my example below - an invalid sam file still generates a 
non-empty BAM file.
This happens because of the -t parameter: samtools takes the header 
information from another file and generates the BAM file before even reading 
the SAM file.

 It seems somewhat more
 informative to tell the user the file is empty rather than just
 successfully output an empty file.
 

Depends on your definition of informative - in this case the error message 
was not so informative at all, to a point a traceback was needed...


-gordon

 
 
 On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
 
 Just another example why python's misleadingly simple idioms are 
 quite dangerous in production code (couldn't help myself from 
 teasing about python... sorry about that).
 
 Seems like line 150 in sam_to_bam.py tries to read the entire
 BAM file into memory just to find out if it's empty or not...
 
 As a stop gap solution with minimal changes, change line 150 from:
  if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( 
 open( tmp_aligns_file_name ).read(10) ) == 0:
 
 Which will read up to the first 10 bytes (instead of the entire 
 file).
 
 A slightly better (but still wrong) solution is to simply check
 the file size, with: if os.path.getsize(tmp_aligns_file_name) ==
 0:
 
 But it's still wrong because even an invalid sam file will create
 a non-empty BAM file (when using samtools view -bt) - the BAM
 file will still contain the chromosome names and sizes.
 
 Example:  $ cat mm9.fa.fai chr1197195432   6 50
 51 chr10   129993255   201139354   50  51 chr11 
 121843856   333732482   50  51 chr12   121257530 
 458013223   50  51 chr13   120284312   581695911 50
 51 chr13_random400311  704385924   50  51 chr14 
 125194864   704794249   50  51 chr15   103494974 
 832493018   50  51 ... ...
 
 $ cat 1.sam Hello World This is not a SAM file
 
 $ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2]
 35 sequences loaded. [sam_read1] reference 'This is not a SAM file'
 is recognized as '*'. [main_samview] truncated file.
 
 $ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr  7 00:57 1.bam 
 -rw-r--r-- 1 gordon hannon  35 Apr  7 00:57 1.sam 
 
 
 So in short, this whole sam-to-bam wrapper tool is not suitable
 for large SAM files (if they don't fit entirely in memory), and not
 for error checking of invalid SAM files.
 
 
 -gordon
 
 
 On 04/07/2011 12:30 AM, Ryan Golhar wrote:
 Here's what I get:
 
 (galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 
 0.1.14 (r933:170) Traceback (most recent call last): File 
 /home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py, line 
 150, in __main__ if len( open( tmp_aligns_file_name ).read() )
 == 0: MemoryError Error extracting alignments from 
 (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), 
 (galaxy_env)[galaxy@vail pbs]$
 
 
 
 On 4/6/11 7:29 PM, Assaf Gordon wrote:
 Ryan,
 
 Since we're shooting in the dark here, best to try and 
 understand what's the exception.
 
 Add the following line to the beginning of sam_to_bam.py: 
 import traceback
 
 and add the following line to sam_to_bam.py line 156 (before 
 the call to stop_err): traceback.print_exc()
 
 Hopefully this will print out which exception you're getting, 
 and where is 

Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-07 Thread Ryan Golhar
I really need to update my local copy.  I must admit, I'm a bit afraid 
to do so.


On 4/7/11 11:35 AM, Kelly Vincent wrote:

That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed
quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used
getsize since then. It's really just intended to catch weird errors that
don't throw an actual error (also, I think that if the sam file has no
header, no info would be output into the bam file). It seems somewhat
more informative to tell the user the file is empty rather than just
successfully output an empty file.

Kelly


On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:


Just another example why python's misleadingly simple idioms are quite
dangerous in production code (couldn't help myself from teasing about
python... sorry about that).

Seems like line 150 in sam_to_bam.py tries to read the entire BAM
file into memory just to find out if it's empty or not...

As a stop gap solution with minimal changes, change line 150 from:
if len( open( tmp_aligns_file_name ).read() ) == 0:
to
if len( open( tmp_aligns_file_name ).read(10) ) == 0:

Which will read up to the first 10 bytes (instead of the entire file).

A slightly better (but still wrong) solution is to simply check the
file size, with:
if os.path.getsize(tmp_aligns_file_name) == 0:

But it's still wrong because even an invalid sam file will create a
non-empty BAM file (when using samtools view -bt) - the BAM file
will still contain the chromosome names and sizes.

Example:

$ cat mm9.fa.fai
chr1 197195432 6 50 51
chr10 129993255 201139354 50 51
chr11 121843856 333732482 50 51
chr12 121257530 458013223 50 51
chr13 120284312 581695911 50 51
chr13_random 400311 704385924 50 51
chr14 125194864 704794249 50 51
chr15 103494974 832493018 50 51
...
...

$ cat 1.sam
Hello World
This is not a SAM file

$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam
[sam_header_read2] 35 sequences loaded.
[sam_read1] reference 'This is not a SAM file' is recognized as '*'.
[main_samview] truncated file.

$ ls -l 1.*
-rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam
-rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam



So in short, this whole sam-to-bam wrapper tool is not suitable for
large SAM files (if they don't fit entirely in memory), and not for
error checking of invalid SAM files.


-gordon


On 04/07/2011 12:30 AM, Ryan Golhar wrote:

Here's what I get:

(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh
Samtools Version: 0.1.14 (r933:170)
Traceback (most recent call last):
File /home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py, line 150,
in __main__
if len( open( tmp_aligns_file_name ).read() ) == 0:
MemoryError
Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
(galaxy_env)[galaxy@vail pbs]$



On 4/6/11 7:29 PM, Assaf Gordon wrote:

Ryan,

Since we're shooting in the dark here, best to try and understand
what's the exception.

Add the following line to the beginning of sam_to_bam.py:
import traceback

and add the following line to sam_to_bam.py line 156 (before the
call to stop_err):
traceback.print_exc()

Hopefully this will print out which exception you're getting, and
where is it thrown from.

-gordon


___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/





--
CONFIDENTIALITY NOTICE: This email communication may contain private, 
confidential, or legally privileged information intended for the sole 
use of the designated and/or duly authorized recipient(s). If you are 
not the intended recipient or have received this email in error, please 
notify the sender immediately by email and permanently delete all copies 
of this email including all attachments without reading them. If you are 
the intended recipient, secure the contents in a manner that conforms to 
all applicable state and/or federal requirements related to privacy and 
confidentiality of such information.
attachment: golharam.vcf___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-07 Thread Ryan Golhar
Speaking of which, the Wiki site hasn't been update to reflect that step 
1 uses hg clone https://bitbucket.org/galaxy/galaxy-dist/;


and step 5 moves to galaxy-central.  Is this supposed to be the case?


On 4/7/11 12:13 PM, Assaf Gordon wrote:

Kelly Vincent wrote, On 04/07/2011 11:35 AM:

That opening-the-entire-file inefficiency in our sam_to_bam.py was
fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has
used getsize since then.


Fixed is one thing, pushed to dist is another :)

$ hg clone https://bitbucket.org/galaxy/galaxy-dist galaxy_test2
requesting all changes
adding changesets
adding manifests
adding file changes
added 5071 changesets with 21893 changes to 4940 files
updating to branch default
3256 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ cd galaxy_test2/
$ hg tip
changeset:   5070:ca0c4ad2bb39
tag: tip
user:Greg Von Kusterg...@bx.psu.edu
date:Wed Feb 16 10:14:24 2011 -0500
summary: When rendering the contents of a data library, make sure a 
LibraryDataset has an associated LibraryDatasetDatasetAssociation.  There 
should always be an ldda, but some users running their own instances have 
reported that some of their LibraryDatasets are missing these associations.
$ sed -n '148,152p' tools/samtools/sam_to_bam.py
 if returncode != 0:
 raise Exception, stderr
 if len( open( tmp_aligns_file_name ).read() ) == 0:
 raise Exception, 'Initial BAM file empty'
 except Exception, e:

It only exists in galaxy-central, not in galaxy-dist which is what you 
recommend people to use.

Both me and Ryan have this line of code, and I'm sure we've both pulled since 
April 2010.


It's really just intended to catch weird errors that don't throw an
actual error (also, I think that if the sam file has no header, no
info would be output into the bam file).


Nope - look at my example below - an invalid sam file still generates a 
non-empty BAM file.
This happens because of the -t parameter: samtools takes the header 
information from another file and generates the BAM file before even reading the SAM file.


It seems somewhat more
informative to tell the user the file is empty rather than just
successfully output an empty file.



Depends on your definition of informative - in this case the error message 
was not so informative at all, to a point a traceback was needed...


-gordon




On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:


Just another example why python's misleadingly simple idioms are
quite dangerous in production code (couldn't help myself from
teasing about python... sorry about that).

Seems like line 150 in sam_to_bam.py tries to read the entire
BAM file into memory just to find out if it's empty or not...

As a stop gap solution with minimal changes, change line 150 from:
  if len( open( tmp_aligns_file_name ).read() ) == 0: to if len(
open( tmp_aligns_file_name ).read(10) ) == 0:

Which will read up to the first 10 bytes (instead of the entire
file).

A slightly better (but still wrong) solution is to simply check
the file size, with: if os.path.getsize(tmp_aligns_file_name) ==
0:

But it's still wrong because even an invalid sam file will create
a non-empty BAM file (when using samtools view -bt) - the BAM
file will still contain the chromosome names and sizes.

Example:  $ cat mm9.fa.fai chr1197195432   6 50
51 chr10   129993255   201139354   50  51 chr11
121843856   333732482   50  51 chr12   121257530
458013223   50  51 chr13   120284312   581695911 50
51 chr13_random400311  704385924   50  51 chr14
125194864   704794249   50  51 chr15   103494974
832493018   50  51 ... ...

$ cat 1.sam Hello World This is not a SAM file

$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2]
35 sequences loaded. [sam_read1] reference 'This is not a SAM file'
is recognized as '*'. [main_samview] truncated file.

$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr  7 00:57 1.bam
-rw-r--r-- 1 gordon hannon  35 Apr  7 00:57 1.sam 


So in short, this whole sam-to-bam wrapper tool is not suitable
for large SAM files (if they don't fit entirely in memory), and not
for error checking of invalid SAM files.


-gordon


On 04/07/2011 12:30 AM, Ryan Golhar wrote:

Here's what I get:

(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version:
0.1.14 (r933:170) Traceback (most recent call last): File
/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py, line
150, in __main__ if len( open( tmp_aligns_file_name ).read() )
== 0: MemoryError Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
(galaxy_env)[galaxy@vail pbs]$



On 4/6/11 7:29 PM, Assaf Gordon wrote:

Ryan,

Since we're shooting in the dark here, best to try and
understand what's the exception.

Add the following line to the beginning of sam_to_bam.py:
import traceback

and add the following line to 

Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-06 Thread Ryan Golhar
Any ideas why I would get this?  If I run the sam_to_bam python script 
from the shell, I get the same error:


(galaxy_env)[galaxy@vail pbs]$ sh 471.sh
Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8
6_64 x86_64 x86_64 GNU/Linux
Samtools Version: 0.1.14 (r933:170)
Error extracting alignments from 
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),


However running the samtools command works fine

On 4/5/11 5:58 PM, Ryan Golhar wrote:

I've performed an alignment using BWA on a file of paired-end illumina
reads. The SAM file looks fine, and contains header information. I'm
converting it to BAM using the sam to bam converter, however it
consistently errors out after running for a while. The error is:

Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), 

but no error is provided. Looking at the sam_to_bam.py on line 156 is
where the error is thrown. Nothing is in e (I think).

BTW - If I run the samtools command from the shell by hand, the BAM file
is created properly. I do see information on stderr:

$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o
/tmp/killme.bam /home/galaxy/galaxy-dist/database/files/000/dataset_785.dat
[samopen] SAM header is present: 25 sequences.

I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.

What do I do?

Ryan
___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-06 Thread Ryan Golhar
So it looks like I can get small sam files converted to bam files, but 
not large sam files (~50GB-80GB).  I'm still trying to debug this, but 
not sure what's going on.


Has anyone else run into anything like this?


On 4/6/11 10:08 AM, Ryan Golhar wrote:

Any ideas why I would get this? If I run the sam_to_bam python script
from the shell, I get the same error:

(galaxy_env)[galaxy@vail pbs]$ sh 471.sh
Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8
6_64 x86_64 x86_64 GNU/Linux
Samtools Version: 0.1.14 (r933:170)
Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),

However running the samtools command works fine

On 4/5/11 5:58 PM, Ryan Golhar wrote:

I've performed an alignment using BWA on a file of paired-end illumina
reads. The SAM file looks fine, and contains header information. I'm
converting it to BAM using the sam to bam converter, however it
consistently errors out after running for a while. The error is:

Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), 

but no error is provided. Looking at the sam_to_bam.py on line 156 is
where the error is thrown. Nothing is in e (I think).

BTW - If I run the samtools command from the shell by hand, the BAM file
is created properly. I do see information on stderr:

$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o
/tmp/killme.bam
/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat
[samopen] SAM header is present: 25 sequences.

I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.

What do I do?

Ryan
___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-06 Thread Ryan Golhar

Alright, I'm at a loss

I can run the sam to bam converter on a small sam file but not a big sam 
file.  The small SAM file is only 65K, the big SAM file is 44G.  I have 
more than 8TB of free space.


Running the job script from the shell results in the small conversion 
succeeding and the big one failing.  The return code from samtools in 
both instances in 0 so I can't for any reason think of why there the 
script is getting caught in an exception.


I even added a write statement to stdout to double-check the return code 
and stderr message and they are the same in both cases.


Why is this failing in one case and not the other?  I'm stuck.  Help

Ryan

On 4/6/11 4:58 PM, Ryan Golhar wrote:

So it looks like I can get small sam files converted to bam files, but
not large sam files (~50GB-80GB). I'm still trying to debug this, but
not sure what's going on.

Has anyone else run into anything like this?


On 4/6/11 10:08 AM, Ryan Golhar wrote:

Any ideas why I would get this? If I run the sam_to_bam python script
from the shell, I get the same error:

(galaxy_env)[galaxy@vail pbs]$ sh 471.sh
Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8
6_64 x86_64 x86_64 GNU/Linux
Samtools Version: 0.1.14 (r933:170)
Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),

However running the samtools command works fine

On 4/5/11 5:58 PM, Ryan Golhar wrote:

I've performed an alignment using BWA on a file of paired-end illumina
reads. The SAM file looks fine, and contains header information. I'm
converting it to BAM using the sam to bam converter, however it
consistently errors out after running for a while. The error is:

Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), 

but no error is provided. Looking at the sam_to_bam.py on line 156 is
where the error is thrown. Nothing is in e (I think).

BTW - If I run the samtools command from the shell by hand, the BAM file
is created properly. I do see information on stderr:

$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o
/tmp/killme.bam
/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat
[samopen] SAM header is present: 25 sequences.

I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.

What do I do?

Ryan
___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-06 Thread Assaf Gordon

Just another example why python's misleadingly simple idioms are quite 
dangerous in production code (couldn't help myself from teasing about python... 
sorry about that).

Seems like line 150 in sam_to_bam.py tries to read the entire BAM file into 
memory just to find out if it's empty or not...

As a stop gap solution with minimal changes, change line 150 from:
if len( open( tmp_aligns_file_name ).read() ) == 0:
to
if len( open( tmp_aligns_file_name ).read(10) ) == 0:

Which will read up to the first 10 bytes (instead of the entire file).

A slightly better (but still wrong) solution is to simply check the file size, 
with:
if os.path.getsize(tmp_aligns_file_name) == 0:

But it's still wrong because even an invalid sam file will create a non-empty BAM file 
(when using samtools view -bt) - the BAM file will still contain the 
chromosome names and sizes.

Example:

$ cat mm9.fa.fai
chr1197195432   6   50  51
chr10   129993255   201139354   50  51
chr11   121843856   333732482   50  51
chr12   121257530   458013223   50  51
chr13   120284312   581695911   50  51
chr13_random400311  704385924   50  51
chr14   125194864   704794249   50  51
chr15   103494974   832493018   50  51
...
...

$ cat 1.sam
Hello World
This is not a SAM file

$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam
[sam_header_read2] 35 sequences loaded.
[sam_read1] reference 'This is not a SAM file' is recognized as '*'.
[main_samview] truncated file.

$ ls -l 1.*
-rw-r--r-- 1 gordon hannon 348 Apr  7 00:57 1.bam
-rw-r--r-- 1 gordon hannon  35 Apr  7 00:57 1.sam



So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM 
files (if they don't fit entirely in memory), and not for error checking of 
invalid SAM files.


-gordon


On 04/07/2011 12:30 AM, Ryan Golhar wrote:

Here's what I get:

(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh
Samtools Version: 0.1.14 (r933:170)
Traceback (most recent call last):
File /home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py, line 150,
in __main__
if len( open( tmp_aligns_file_name ).read() ) == 0:
MemoryError
Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
(galaxy_env)[galaxy@vail pbs]$



On 4/6/11 7:29 PM, Assaf Gordon wrote:

Ryan,

Since we're shooting in the dark here, best to try and understand
what's the exception.

Add the following line to the beginning of sam_to_bam.py:
import traceback

and add the following line to sam_to_bam.py line 156 (before the
call to stop_err):
traceback.print_exc()

Hopefully this will print out which exception you're getting, and
where is it thrown from.

-gordon


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] samtools sam-to-bam problem

2011-04-06 Thread Assaf Gordon

shameless plug
If your sam file already contains header lines, you can use our version of the 
sam-to-bam wrapper.
It works without python and without writing a temporary (non-sorted) bam file 
to disk.
Not so fast, and with minimal error checking - but it mostly works.

http://cancan.cshl.edu/labmembers/gordon/files/cshl_sam_to_bam.tar.bz2
/shameless plug

On 04/07/2011 01:05 AM, Assaf Gordon wrote:

Just another example why python's misleadingly simple idioms are quite
dangerous in production code (couldn't help myself from teasing about
python... sorry about that).

Seems like line 150 in sam_to_bam.py tries to read the entire BAM file
into memory just to find out if it's empty or not...

As a stop gap solution with minimal changes, change line 150 from:
if len( open( tmp_aligns_file_name ).read() ) == 0:
to
if len( open( tmp_aligns_file_name ).read(10) ) == 0:

Which will read up to the first 10 bytes (instead of the entire file).

A slightly better (but still wrong) solution is to simply check the file
size, with:
if os.path.getsize(tmp_aligns_file_name) == 0:

But it's still wrong because even an invalid sam file will create a
non-empty BAM file (when using samtools view -bt) - the BAM file will
still contain the chromosome names and sizes.

Example:

$ cat mm9.fa.fai
chr1 197195432 6 50 51
chr10 129993255 201139354 50 51
chr11 121843856 333732482 50 51
chr12 121257530 458013223 50 51
chr13 120284312 581695911 50 51
chr13_random 400311 704385924 50 51
chr14 125194864 704794249 50 51
chr15 103494974 832493018 50 51
...
...

$ cat 1.sam
Hello World
This is not a SAM file

$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam
[sam_header_read2] 35 sequences loaded.
[sam_read1] reference 'This is not a SAM file' is recognized as '*'.
[main_samview] truncated file.

$ ls -l 1.*
-rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam
-rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam



So in short, this whole sam-to-bam wrapper tool is not suitable for
large SAM files (if they don't fit entirely in memory), and not for
error checking of invalid SAM files.


-gordon


On 04/07/2011 12:30 AM, Ryan Golhar wrote:

Here's what I get:

(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh
Samtools Version: 0.1.14 (r933:170)
Traceback (most recent call last):
File /home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py, line 150,
in __main__
if len( open( tmp_aligns_file_name ).read() ) == 0:
MemoryError
Error extracting alignments from
(/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
(galaxy_env)[galaxy@vail pbs]$



On 4/6/11 7:29 PM, Assaf Gordon wrote:

Ryan,

Since we're shooting in the dark here, best to try and understand
what's the exception.

Add the following line to the beginning of sam_to_bam.py:
import traceback

and add the following line to sam_to_bam.py line 156 (before the
call to stop_err):
traceback.print_exc()

Hopefully this will print out which exception you're getting, and
where is it thrown from.

-gordon


___
Please keep all replies on the list by using reply all
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/