Re: [Genome] broken ENCODE fastq file ?

Katrina Learned Wed, 17 Nov 2010 10:01:48 -0800

Hi Ross,

Thank you very much for bringing this fastq file issue to our attention 
with so much detailed information. We have contacted the submitting lab 
and requested new fastq files. The lab is working on the new files, but 
it looks like we probably won't receive them until close to the end of 
the year. Please feel free to contact us in the new year if you still 
don't see the new files posted.


Thanks!

Katrina Learned
UCSC Genome Bioinformatics Group

Ross Lazarus wrote, On 11/15/10 14:12:
> Following up on the broken fastq ENCODE files - both Rep1 and Rep2 fastq
> from K562 FAIRE are broken in similar ways - starting around row 86M and 89M
> respectively.
> They're both about 260M rows and near 7GB unpacked so I wonder if this
> problem is associated with 32 bit software breakage somewhere upstream?
>
> Mon,Nov 15 at 2:13pm head
> wgEncodeUncFAIREseqRawDataRep1K562V2.fastq_checkfastq.out
> ### row 86665921 has 35 seq but row 86665923 has 36
> qual=?&=@@?BCCCB;AAA;b...@?bbba:?@>@><@@@!
> ### row 86665925 has 35 seq but row 86665927 has 36
> qu...@8=@=...@?==>@8&3...@=>=...@==>?=?>?===?=!
> ### row 86665929 has 35 seq but row 86665931 has 36 qual=...@89bcc
> <4=:=:C?C>6>c...@=9@:@97A962!
> ### row 86665933 has 35 seq but row 86665935 has 36 qual=...@aa<087@>?>>9>0@
> ?=B8:99...@8=><%!
> ### row 86665937 has 35 seq but row 86665939 has 36 qual=5==>>AAC??:9...@bac
> <>+8;@B;;B::*7<?BA>!
> ### row 86665941 has 35 seq but row 86665943 has 36 qual=aa...@=ac?bba@
> ?BBBB:?...@7:AB>;7?>@!
> ### row 86665945 has 35 seq but row 86665947 has 36 qual=AAA-,A,@@7BAB,B@
> @8...@?<@@@b?...@6b+@!
> ### row 86665949 has 35 seq but row 86665951 has 36
> qual==;>@:6<=<=?8====<==93339===3/38/339!
> ### row 86665953 has 35 seq but row 86665955 has 36 qual=A@
> @5ABBBBCBAACBABBBCABBA<BBA9>9%%%%!
> ### row 86665957 has 35 seq but row 86665959 has 36 
> qu...@cb==a;<b...@49@>b...@4
> <a...@?8?<A;..6C8?B:!
>
> Mon,Nov 15 at 2:22pm head
> wgEncodeUncFAIREseqRawDataRep2K562V2.fastq_checkfastq.out
> ### row 89288945 has 35 seq but row 89288947 has 36
> qual==>>?=>=>>=>>==>==>@=>=>=>=>=>>>>===!
> ### row 89288949 has 35 seq but row 89288951 has 36
> qual=3=====?==>=>==533==>==...@===>//366!
> ### row 89288953 has 35 seq but row 89288955 has 36
> qual=?==?=>=======>>>9=>==>=...@=====>><>!
> ### row 89288957 has 35 seq but row 89288959 has 36
> qual=9>==>=========9====//685533335/8...@!
> ### row 89288961 has 35 seq but row 89288963 has 36
> qual==356>=...@=814==<=<>/7...@93/3=?3/7===!
> ### row 89288965 has 35 seq but row 89288967 has 36
> qual====>====>===>=>=====>@=>>?336313===!
> ### row 89288969 has 35 seq but row 89288971 has 36
> qual=?=>=>>=?=======>?====>=>==>==?9=>==!
> ### row 89288973 has 35 seq but row 89288975 has 36 qua...@=>9>9:@
> =:5/=533)=A99>:?==336/:=?>!
> ### row 89288977 has 35 seq but row 89288979 has 36
> qual==?=?=8>=<>=...@=:=...@=?9@9??<=1/19=!
> ### row 89288981 has 35 seq but row 89288983 has 36
> qual====>>=======>==>=>=633>==?===>==>==!
>
> Here's a slightly modified python script in case anyone wants to use it -
> this version can scan a gzip compressed fastq file - hope this helps - it's
> much slower but far more disk efficient to scan hundreds of .gz fasta files
> in place.
>
> """
> look for bad q/s lengths
> Mon,Nov 15 at 7:59am head *.fastq
> @HWI-EAS68_6_FC206E3_1_1_118_667
>
>
> GAAATTATTTTTTCCGAATTGAAGATGAAAATA
> +
>
>
> IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA
> @HWI-EAS68_6_FC206E3_1_1_190_636
>
>
> GAGAACACATTTTCTCACTGTTGAGCTAATAAT
> +
>
>
> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII1CI
> @HWI-EAS68_6_FC206E3_1_1_180_558
>
>
> GTTTCCTAAATTGTAAGTTGAAAGATTTAGAAG
>
> Written to test some ENCODE fastq files downloaded from the UCSC repo
>
> Copyright Ross Lazarus at gmail dot com 2010
> Source code licensed under the terms of the LGPL
> enjoy...
>
> """
> import os,sys,gzip
>
>
> slen = None
> qlen = None
> assert len(sys.argv[1]) > 1, 'Please supply a valid fastq file (.gz or
> .fastq) as the first parameter'
> assert os.path.isfile(sys.argv[1]), 'Please supply a valid fastq file (.gz
> or .fastq) as the first parameter'
> fname = sys.argv[1]
> ext = os.path.splitext(fname)[-1]
> if ext == '.gz':
>     f = gzip.open(fname,'r')
> else:
>     f = open(fname,'r')
> outf = file('%s_checkfastq.out' % fname,'w')
> rt = ('metad','seq','strand','qual')
> i = 0
> duds = 0
> done = False
> while not done:
>     row = f.readline()
>     if len(row) == 0:
>         done = True
>         break
>     row = row.strip()
>     r = rt[i % 4]
>     rowl = len(row)
>     i += 1
>     if r == 'seq':
>         slen = rowl
>         qlen = None
>     elif r == 'qual':
>         qlen=rowl
>         if slen <> rowl:
>             s= '### row %d has %d seq but row %d has %d qual=%s\n' %
> (i-2,slen,i,qlen,row) # python 0=row 1
>             outf.write(s)
>             duds += 1
>     if i % 10000000 == 0:
>         print 'at row', i, duds, 'dud sequences found so far'
> outf.write('\n')
> outf.close()
> if duds > 0:
>     print '%s had %d fastq sequences in error' % (fname,duds)
>
>
>
>
> On Mon, Nov 15, 2010 at 12:09 PM, Ross Lazarus <
> [email protected]> wrote:
>   
>> I have a question about
>>
>>     
> http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeChromatinMap/wgEncodeUncFAIREseqRawDataRep1K562V2.fastq.gz
>   
>> The file might be corrupt although it gunzipped without error.
>>
>> There appear to be 41850460 sequences starting at row 89288945 with an
>> extra character (always "!") as a bogus 36th quality score for a 35
>> character sequence.
>> The first broken row quality scores are shown below.
>>
>>
>> ### row 89288945 has 35 seq but row 89288947 has 36
>> qual==>>?=>=>>=>>==>==>@=>=>=>=>=>>>>===!
>> ### row 89288949 has 35 seq but row 89288951 has 36
>> qual=3=====?==>=>==533==>==...@===>//366!
>> ### row 89288953 has 35 seq but row 89288955 has 36
>> qual=?==?=>=======>>>9=>==>=...@=====>><>!
>> ### row 89288957 has 35 seq but row 89288959 has 36
>> qual=9>==>=========9====//685533335/8...@!
>> ### row 89288961 has 35 seq but row 89288963 has 36
>> qual==356>=...@=814==<=<>/7...@93/3=?3/7===!
>> ### row 89288965 has 35 seq but row 89288967 has 36
>> qual====>====>===>=>=====>@=>>?336313===!
>> ### row 89288969 has 35 seq but row 89288971 has 36
>> qual=?=>=>>=?=======>?====>=>==>==?9=>==!
>> ### row 89288973 has 35 seq but row 89288975 has 36
>> qua...@=>9>9:@=:5/=533)=A99>:?==336/:=?>!
>> ### row 89288977 has 35 seq but row 89288979 has 36
>> qual==?=?=8>=<>=...@=:=...@=?9@9??<=1/19=!
>> ### row 89288981 has 35 seq but row 89288983 has 36
>> qual====>>=======>==>=>=633>==?===>==>==!
>>
>> Any suggestions? I'd rather work with a valid fastq rather than just
>> truncate those 41M quality scores....
>>
>> (in case anyone wants it, here's the check script)
>> """
>> look for bad q/s lengths
>> Mon,Nov 15 at 7:59am head *.fastq
>> @HWI-EAS68_6_FC206E3_1_1_118_667
>> GAAATTATTTTTTCCGAATTGAAGATGAAAATA
>> +
>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA
>> @HWI-EAS68_6_FC206E3_1_1_190_636
>> GAGAACACATTTTCTCACTGTTGAGCTAATAAT
>> +
>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII1CI
>> @HWI-EAS68_6_FC206E3_1_1_180_558
>> GTTTCCTAAATTGTAAGTTGAAAGATTTAGAAG
>> """
>> import os,sys
>>
>>
>> slen = None
>> qlen = None
>> assert os.path.isfile(sys.argv[1]), 'Please supply a fastq file path
>> as the first parameter'
>> f = file(sys.argv[1],'r')
>> outf = file('checkfastq.out','w')
>> rt = ('metad','seq','strand','qual')
>> for i,row in enumerate(f):
>>    row = row.strip()
>>    r = rt[i % 4]
>>    rowl = len(row)
>>    if r == 'seq':
>>        slen = rowl
>>        qlen = None
>>    elif r == 'qual':
>>        qlen=rowl
>>        if slen <> rowl:
>>            s= '### row %d has %d seq but row %d had %d qual=%s\n' %
>> (i-2,slen,i,qlen,row)
>>            outf.write(s)
>>    if (i+1) % 10000000 == 0:
>>          print 'at row', i
>> outf.write('\n')
>> outf.close()
>>
>>
>>
>> --
>> Ross Lazarus MBBS MPH
>> Associate Professor, Harvard Medical School
>> Director of Bioinformatics, Channing Laboratory
>> 181 Longwood Ave., Boston MA 02115, USA.
>> Tel: +1 617 505 4850   Fax: +617 525 0958
>>
>>     
>
>
>
> --
> Ross Lazarus MBBS MPH
> Associate Professor, Harvard Medical School
> Director of Bioinformatics, Channing Laboratory
> 181 Longwood Ave., Boston MA 02115, USA.
> Tel: +1 617 505 4850   Fax: +617 525 0958
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>   
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] broken ENCODE fastq file ?

Reply via email to