Re: [Genome] broken ENCODE fastq file ?

Ross Lazarus Mon, 15 Nov 2010 14:13:49 -0800

Following up on the broken fastq ENCODE files - both Rep1 and Rep2 fastq
from K562 FAIRE are broken in similar ways - starting around row 86M and 89M
respectively.
They're both about 260M rows and near 7GB unpacked so I wonder if this
problem is associated with 32 bit software breakage somewhere upstream?


Mon,Nov 15 at 2:13pm head
wgEncodeUncFAIREseqRawDataRep1K562V2.fastq_checkfastq.out
### row 86665921 has 35 seq but row 86665923 has 36
qual=?&=@@?BCCCB;AAA;b...@?bbba:?@>@><@@@!
### row 86665925 has 35 seq but row 86665927 has 36
qu...@8=@=...@?==>@8&3...@=>=...@==>?=?>?===?=!
### row 86665929 has 35 seq but row 86665931 has 36 qual=...@89bcc
<4=:=:C?C>6>c...@=9@:@97A962!
### row 86665933 has 35 seq but row 86665935 has 36 qual=...@aa<087@>?>>9>0@
?=B8:99...@8=><%!
### row 86665937 has 35 seq but row 86665939 has 36 qual=5==>>AAC??:9...@bac
<>+8;@B;;B::*7<?BA>!
### row 86665941 has 35 seq but row 86665943 has 36 qual=aa...@=ac?bba@
?BBBB:?...@7:AB>;7?>@!
### row 86665945 has 35 seq but row 86665947 has 36 qual=AAA-,A,@@7BAB,B@
@8...@?<@@@b?...@6b+@!
### row 86665949 has 35 seq but row 86665951 has 36
qual==;>@:6<=<=?8====<==93339===3/38/339!
### row 86665953 has 35 seq but row 86665955 has 36 qual=A@
@5ABBBBCBAACBABBBCABBA<BBA9>9%%%%!
### row 86665957 has 35 seq but row 86665959 has 36 qu...@cb==a;<b...@49@>b...@4
<a...@?8?<A;..6C8?B:!

Mon,Nov 15 at 2:22pm head
wgEncodeUncFAIREseqRawDataRep2K562V2.fastq_checkfastq.out
### row 89288945 has 35 seq but row 89288947 has 36
qual==>>?=>=>>=>>==>==>@=>=>=>=>=>>>>===!
### row 89288949 has 35 seq but row 89288951 has 36
qual=3=====?==>=>==533==>==...@===>//366!
### row 89288953 has 35 seq but row 89288955 has 36
qual=?==?=>=======>>>9=>==>=...@=====>><>!
### row 89288957 has 35 seq but row 89288959 has 36
qual=9>==>=========9====//685533335/8...@!
### row 89288961 has 35 seq but row 89288963 has 36
qual==356>=...@=814==<=<>/7...@93/3=?3/7===!
### row 89288965 has 35 seq but row 89288967 has 36
qual====>====>===>=>=====>@=>>?336313===!
### row 89288969 has 35 seq but row 89288971 has 36
qual=?=>=>>=?=======>?====>=>==>==?9=>==!
### row 89288973 has 35 seq but row 89288975 has 36 qua...@=>9>9:@
=:5/=533)=A99>:?==336/:=?>!
### row 89288977 has 35 seq but row 89288979 has 36
qual==?=?=8>=<>=...@=:=...@=?9@9??<=1/19=!
### row 89288981 has 35 seq but row 89288983 has 36
qual====>>=======>==>=>=633>==?===>==>==!

Here's a slightly modified python script in case anyone wants to use it -
this version can scan a gzip compressed fastq file - hope this helps - it's
much slower but far more disk efficient to scan hundreds of .gz fasta files
in place.

"""
look for bad q/s lengths
Mon,Nov 15 at 7:59am head *.fastq
@HWI-EAS68_6_FC206E3_1_1_118_667


GAAATTATTTTTTCCGAATTGAAGATGAAAATA
+


IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA
@HWI-EAS68_6_FC206E3_1_1_190_636


GAGAACACATTTTCTCACTGTTGAGCTAATAAT
+


IIIIIIIIIIIIIIIIIIIIIIIIIIIIII1CI
@HWI-EAS68_6_FC206E3_1_1_180_558


GTTTCCTAAATTGTAAGTTGAAAGATTTAGAAG

Written to test some ENCODE fastq files downloaded from the UCSC repo

Copyright Ross Lazarus at gmail dot com 2010
Source code licensed under the terms of the LGPL
enjoy...

"""
import os,sys,gzip


slen = None
qlen = None
assert len(sys.argv[1]) > 1, 'Please supply a valid fastq file (.gz or
.fastq) as the first parameter'
assert os.path.isfile(sys.argv[1]), 'Please supply a valid fastq file (.gz
or .fastq) as the first parameter'
fname = sys.argv[1]
ext = os.path.splitext(fname)[-1]
if ext == '.gz':
    f = gzip.open(fname,'r')
else:
    f = open(fname,'r')
outf = file('%s_checkfastq.out' % fname,'w')
rt = ('metad','seq','strand','qual')
i = 0
duds = 0
done = False
while not done:
    row = f.readline()
    if len(row) == 0:
        done = True
        break
    row = row.strip()
    r = rt[i % 4]
    rowl = len(row)
    i += 1
    if r == 'seq':
        slen = rowl
        qlen = None
    elif r == 'qual':
        qlen=rowl
        if slen <> rowl:
            s= '### row %d has %d seq but row %d has %d qual=%s\n' %
(i-2,slen,i,qlen,row) # python 0=row 1
            outf.write(s)
            duds += 1
    if i % 10000000 == 0:
        print 'at row', i, duds, 'dud sequences found so far'
outf.write('\n')
outf.close()
if duds > 0:
    print '%s had %d fastq sequences in error' % (fname,duds)




On Mon, Nov 15, 2010 at 12:09 PM, Ross Lazarus <
[email protected]> wrote:
> I have a question about
>
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeChromatinMap/wgEncodeUncFAIREseqRawDataRep1K562V2.fastq.gz
>
> The file might be corrupt although it gunzipped without error.
>
> There appear to be 41850460 sequences starting at row 89288945 with an
> extra character (always "!") as a bogus 36th quality score for a 35
> character sequence.
> The first broken row quality scores are shown below.
>
>
> ### row 89288945 has 35 seq but row 89288947 has 36
> qual==>>?=>=>>=>>==>==>@=>=>=>=>=>>>>===!
> ### row 89288949 has 35 seq but row 89288951 has 36
> qual=3=====?==>=>==533==>==...@===>//366!
> ### row 89288953 has 35 seq but row 89288955 has 36
> qual=?==?=>=======>>>9=>==>=...@=====>><>!
> ### row 89288957 has 35 seq but row 89288959 has 36
> qual=9>==>=========9====//685533335/8...@!
> ### row 89288961 has 35 seq but row 89288963 has 36
> qual==356>=...@=814==<=<>/7...@93/3=?3/7===!
> ### row 89288965 has 35 seq but row 89288967 has 36
> qual====>====>===>=>=====>@=>>?336313===!
> ### row 89288969 has 35 seq but row 89288971 has 36
> qual=?=>=>>=?=======>?====>=>==>==?9=>==!
> ### row 89288973 has 35 seq but row 89288975 has 36
> qua...@=>9>9:@=:5/=533)=A99>:?==336/:=?>!
> ### row 89288977 has 35 seq but row 89288979 has 36
> qual==?=?=8>=<>=...@=:=...@=?9@9??<=1/19=!
> ### row 89288981 has 35 seq but row 89288983 has 36
> qual====>>=======>==>=>=633>==?===>==>==!
>
> Any suggestions? I'd rather work with a valid fastq rather than just
> truncate those 41M quality scores....
>
> (in case anyone wants it, here's the check script)
> """
> look for bad q/s lengths
> Mon,Nov 15 at 7:59am head *.fastq
> @HWI-EAS68_6_FC206E3_1_1_118_667
> GAAATTATTTTTTCCGAATTGAAGATGAAAATA
> +
> IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA
> @HWI-EAS68_6_FC206E3_1_1_190_636
> GAGAACACATTTTCTCACTGTTGAGCTAATAAT
> +
> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII1CI
> @HWI-EAS68_6_FC206E3_1_1_180_558
> GTTTCCTAAATTGTAAGTTGAAAGATTTAGAAG
> """
> import os,sys
>
>
> slen = None
> qlen = None
> assert os.path.isfile(sys.argv[1]), 'Please supply a fastq file path
> as the first parameter'
> f = file(sys.argv[1],'r')
> outf = file('checkfastq.out','w')
> rt = ('metad','seq','strand','qual')
> for i,row in enumerate(f):
>    row = row.strip()
>    r = rt[i % 4]
>    rowl = len(row)
>    if r == 'seq':
>        slen = rowl
>        qlen = None
>    elif r == 'qual':
>        qlen=rowl
>        if slen <> rowl:
>            s= '### row %d has %d seq but row %d had %d qual=%s\n' %
> (i-2,slen,i,qlen,row)
>            outf.write(s)
>    if (i+1) % 10000000 == 0:
>          print 'at row', i
> outf.write('\n')
> outf.close()
>
>
>
> --
> Ross Lazarus MBBS MPH
> Associate Professor, Harvard Medical School
> Director of Bioinformatics, Channing Laboratory
> 181 Longwood Ave., Boston MA 02115, USA.
> Tel: +1 617 505 4850   Fax: +617 525 0958
>



--
Ross Lazarus MBBS MPH
Associate Professor, Harvard Medical School
Director of Bioinformatics, Channing Laboratory
181 Longwood Ave., Boston MA 02115, USA.
Tel: +1 617 505 4850   Fax: +617 525 0958
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] broken ENCODE fastq file ?

Reply via email to