I have a question about
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeChromatinMap/wgEncodeUncFAIREseqRawDataRep1K562V2.fastq.gz
The file might be corrupt although it gunzipped without error.
There appear to be 41850460 sequences starting at row 89288945 with an
extra character (always "!") as a bogus 36th quality score for a 35
character sequence.
The first broken row quality scores are shown below.
### row 89288945 has 35 seq but row 89288947 has 36
qual==>>?=>=>>=>>==>==>@=>=>=>=>=>>>>===!
### row 89288949 has 35 seq but row 89288951 has 36
qual=3=====?==>=>==533==>==...@===>//366!
### row 89288953 has 35 seq but row 89288955 has 36
qual=?==?=>=======>>>9=>==>=...@=====>><>!
### row 89288957 has 35 seq but row 89288959 has 36
qual=9>==>=========9====//685533335/8...@!
### row 89288961 has 35 seq but row 89288963 has 36
qual==356>=...@=814==<=<>/7...@93/3=?3/7===!
### row 89288965 has 35 seq but row 89288967 has 36
qual====>====>===>=>=====>@=>>?336313===!
### row 89288969 has 35 seq but row 89288971 has 36
qual=?=>=>>=?=======>?====>=>==>==?9=>==!
### row 89288973 has 35 seq but row 89288975 has 36
qua...@=>9>9:@=:5/=533)=A99>:?==336/:=?>!
### row 89288977 has 35 seq but row 89288979 has 36
qual==?=?=8>=<>=...@=:=...@=?9@9??<=1/19=!
### row 89288981 has 35 seq but row 89288983 has 36
qual====>>=======>==>=>=633>==?===>==>==!
Any suggestions? I'd rather work with a valid fastq rather than just
truncate those 41M quality scores....
(in case anyone wants it, here's the check script)
"""
look for bad q/s lengths
Mon,Nov 15 at 7:59am head *.fastq
@HWI-EAS68_6_FC206E3_1_1_118_667
GAAATTATTTTTTCCGAATTGAAGATGAAAATA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA
@HWI-EAS68_6_FC206E3_1_1_190_636
GAGAACACATTTTCTCACTGTTGAGCTAATAAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII1CI
@HWI-EAS68_6_FC206E3_1_1_180_558
GTTTCCTAAATTGTAAGTTGAAAGATTTAGAAG
"""
import os,sys
slen = None
qlen = None
assert os.path.isfile(sys.argv[1]), 'Please supply a fastq file path
as the first parameter'
f = file(sys.argv[1],'r')
outf = file('checkfastq.out','w')
rt = ('metad','seq','strand','qual')
for i,row in enumerate(f):
row = row.strip()
r = rt[i % 4]
rowl = len(row)
if r == 'seq':
slen = rowl
qlen = None
elif r == 'qual':
qlen=rowl
if slen <> rowl:
s= '### row %d has %d seq but row %d had %d qual=%s\n' %
(i-2,slen,i,qlen,row)
outf.write(s)
if (i+1) % 10000000 == 0:
print 'at row', i
outf.write('\n')
outf.close()
--
Ross Lazarus MBBS MPH
Associate Professor, Harvard Medical School
Director of Bioinformatics, Channing Laboratory
181 Longwood Ave., Boston MA 02115, USA.
Tel: +1 617 505 4850 Fax: +617 525 0958
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome