Hi all,
I've been playing around with the sff file based on the file
format definition at NCBI.
I attached the output which includes the common header,
the read header and read data section for the first read
of that file.
I'm happy to answer questions on how the file format works
(including the undocumented index block which I had to reverse
engineer).
Yes, I would like to know how that works.
index_magic_number:778921588 .mft
version:1.00
Couldn't find anything about ".mft" version 1.
At the moment I have two classes: sffParser and sffFile
My idea was that sffParser can hold one or multiple sff files. Each
instance of
sffFile has a hashtable with the identifiers as keys and the
filepointers are
stored as the values.
Now I would like to find a good representation of one single "read"
object, which
shall be accessible with an identifier like EV5RTWS02JXUUH
At the moment I'm making use of the BigInteger class to store many variables
but thats probably a waste of memory.
The variables for the read object I'm thinking of:
Read Header Section:
read_header_length -> int
name_length -> int
number_of_bases -> int
clip_qual_left -> int
clip_qual_right -> int
clip_adapter_left -> int
clip_adapter_right -> int
name -> string
Read Data Section:
flowgram_values -> float[]
flow_index_per_base -> int[]
bases: -> string
quality_scores -> int[]
But I'm not very familiar with the existing data structures of BioJava,
is there
maybe already something similar existing? Please comment.
Cheers,
Charles
Parsing /home/charlie/Desktop/EV5RTWS02.sff
Length: 235403092
=> Common Header Section
magic_number:779314790 .sff
version:0001
index_offset:232575224
index_length:2827868
number_of_reads:141371
header_length:440
key_length:4
number_of_flows_per_read:400
flowgram_format_code:1
flow_chars:TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG
key_sequence:TCAG
index_magic_number:778921588 .mft
version:1.00
=> Scanning file...
reads hashed: 141371
=> Read Header Section
read_header_length: 32
name_length: 14
number_of_bases: 197
clip_qual_left: 5
clip_qual_right: 188
clip_adapter_left: 0
clip_adapter_right: 0
name: EV5RTWS02JXUUH
=> Read Data Section
flowgram_values:
1.05 0.0 1.03 0.0 0.0 0.97 0.0 0.99 1.13 0.25 0.76 0.15 0.1 0.89 0.14 0.94 0.96
0.17 0.04 1.84 1.18 0.91 0.08 0.21 1.06 0.18 0.98 0.06 0.17 1.73 1.14 1.0 0.13
0.3 1.12 0.13 0.23 0.85 0.31 0.91 0.12 1.1 0.09 1.08 1.01 1.86 0.12 0.98 0.23
0.16 1.01 0.22 0.04 1.06 0.14 1.0 1.0 0.1 0.09 1.95 0.89 1.04 0.11 0.26 1.01
0.3 1.02 0.18 0.8 1.83 1.16 1.2 0.19 0.13 0.92 0.23 0.2 1.02 0.14 0.99 0.23
0.98 0.1 1.02 1.15 1.89 0.19 0.92 0.07 0.19 0.93 0.19 0.06 0.97 0.15 1.03 1.1
0.09 0.23 1.87 0.85 1.16 0.05 0.12 0.95 0.13 0.96 0.13 0.06 1.94 1.04 1.04 0.22
0.03 1.01 0.1 0.12 1.03 0.12 1.11 0.04 1.02 0.07 1.0 1.03 1.82 0.07 1.03 0.18
0.06 1.1 0.08 0.11 1.0 0.09 0.86 0.92 0.22 0.15 1.75 0.91 1.37 0.14 0.0 1.21
0.34 1.09 0.12 0.25 2.37 1.04 1.16 0.13 0.1 0.86 0.13 0.25 1.01 0.04 1.15 0.1
0.94 0.09 1.07 1.03 1.74 0.1 0.85 0.08 0.0 0.99 0.0 0.12 0.9 0.04 0.89 1.01
0.08 0.1 2.16 0.94 1.25 0.03 0.0 1.37 0.14 1.0 0.15 0.18 2.04 0.97 1.2 0.21
0.09 0.81 0.06 0.19 1.01 0.04 1.24 0.06 1.02 0.23 0.7 1.12 1.87 0.11 0.84 0.0
0.0 1.01 0.0 0.16 1.15 0.08 1.06 1.11 0.06 0.09 1.77 1.01 1.13 0.08 0.0 0.81
0.18 0.96 0.14 0.15 1.86 0.96 1.05 0.16 0.04 0.88 0.0 0.09 1.32 0.09 1.12 0.27
0.92 0.11 1.08 1.12 1.94 0.1 0.89 0.08 0.0 1.12 0.05 0.21 1.1 0.13 0.88 0.96
0.03 0.07 2.04 0.91 1.06 0.14 0.0 0.76 0.17 1.05 0.13 0.14 2.01 1.05 1.11 0.2
0.0 1.02 0.0 0.16 1.09 0.05 1.27 0.07 0.92 0.08 0.89 1.08 1.87 0.25 0.74 0.0
0.11 1.05 0.06 0.23 2.01 1.03 0.03 1.2 0.0 0.02 1.04 0.0 1.01 0.05 0.06 1.06
0.1 0.2 1.95 0.03 0.11 1.14 0.78 0.16 0.24 1.05 1.01 0.13 1.22 0.08 2.86 0.23
0.81 0.02 1.05 0.13 0.04 1.11 0.0 0.13 0.41 0.01 0.1 0.17 0.37 0.18 0.22 0.15
0.26 0.14 0.26 0.18 0.21 0.15 0.24 0.16 0.16 0.11 0.26 0.14 0.15 0.1 0.22 0.12
0.15 0.11 0.15 0.1 0.14 0.12 0.16 0.09 0.14 0.11 0.15 0.1 0.23 0.09 0.18 0.09
0.25 0.09 0.17 0.12 0.17 0.1 0.14 0.09 0.13 0.08 0.15 0.06 0.11 0.1 0.15 0.07
0.09 0.1 0.16 0.03 0.09 0.09 0.17 0.04 0.1 0.07 0.16 0.04 0.1 0.05 0.19
flow_index_per_base:
1 2 3 2 1 2 3 2 1 3 0 1 1 3 2 3 0 1 1 3 3 2 2 2 1 1 0 2 3 3 2 1 3 0 1 1 3 2 2 1
0 1 1 3 3 2 2 2 1 1 0 2 3 3 2 1 3 0 1 1 3 2 3 0 1 1 3 3 2 2 2 1 1 0 2 3 3 2 1 3
0 1 1 3 2 3 0 1 1 3 3 2 2 2 1 1 0 2 3 3 2 1 3 0 1 1 3 2 3 0 1 1 3 3 2 2 2 1 1 0
2 3 3 2 1 3 0 1 1 3 2 3 0 1 1 3 3 2 2 2 1 1 0 2 3 3 2 1 3 0 1 1 3 2 3 0 1 1 3 3
2 2 2 1 1 0 2 3 3 0 1 2 3 2 3 3 0 3 1 3 1 2 2 0 0 2 2 3 2 5 4 4 4 4 4 12 4
bases:
TCAGTCAGTGGTATCAACGCAGAGTAAGCAGTGGTATCTAACGCAGAGTAAGCAGTGGTATCAACGCAGAGTAAGCAGTGGTATCAACGCAGAGTAAGCAGTGGTATCAACGCAGAGTAAGCAGTGGTATCAACGCAGAGTAAGCAGTGGTATCAACGCAGAGTAAGCAACTGATGGCGCGAGGGAGCNNNNNNNNN
quality_scores: 27 27 26 27 24 16 24 25 26 27 21 22 24 27 26 24 17 23 27 24 22
24 26 26 27 28 22 26 27 26 27 27 30 25 24 27 27 27 19 27 21 23 21 25 27 27 26
27 23 29 24 25 25 26 27 26 28 22 22 22 26 26 30 25 27 27 27 27 25 27 27 27 27
21 27 25 27 22 25 25 18 24 15 20 26 27 24 27 23 22 27 23 25 26 27 24 18 22 27
24 24 27 30 27 25 18 15 27 31 27 26 21 19 27 19 27 14 24 28 22 21 27 23 27 25
25 19 27 24 20 26 28 22 26 27 23 16 24 25 26 24 30 25 24 25 25 23 26 31 27 24
27 17 27 31 26 27 25 27 26 18 25 24 26 28 22 16 27 31 26 27 21 27 27 27 30 25
24 19 27 27 19 30 28 11 20 27 25 0 0 0 0 0 0 0 0 0
_______________________________________________
Biojava-l mailing list - [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l