On Wednesday, 23 August 2017 at 13:06:36 UTC, Steven
Schveighoffer wrote:
On 8/23/17 5:53 AM, biocyberman wrote:
[...]
I'll respond to all your questions with what I would do,
instead of answering each one.
I would suggest an approach similar to how I approached parsing
JSON data. In your case, the protocol is even simpler, as there
is no nesting.
1. The base layer iopipe should be something that tokenizes the
input into reference-based structs. If you look at the
jsoniopipe library (https://github.com/schveiguy/jsoniopipe),
you can see that the lowest level finds the start of the next
JSON token. In your case, it should be looking for
>[...]
This code is pretty straightforward, and roughly corresponds to
this:
while(cannot find start sequence in stream)
stream.extend;
make sure you aren't re-doing work that has already been done
(i.e. save the last place you looked).
Once you have this, you can deduce each packet by the data
between the starts.
2. The next layer should validate and parse the data into
structs that contain referencing data from the buffer. I
recommend not using actual ranges from the buffer, but
information on how to build the ranges. The reason for this is
that the buffer can move while being streamed by iopipe, so
your data could become invalid if you take actual references to
the buffer. If you look in the jsoniopipe library, the struct
for storing a json item has a start and length, but not a
reference to the buffer.
Potentially, you could take this mechanism and build an iopipe
on top of the buffered data. This iopipe's elements would be
the items themselves, with the underlying buffer hidden in the
implementation details. Extending would parse out another set
of items, releasing would allow those items to get reclaimed
(and the underlying stream data).
This is something I actually wanted to explore with jsoniopipe
but didn't have time before the conference. I probably will
still build it.
3. build your real code on top of that layer. What do you want
to do with the data? Easiest thing to do for proof of concept
is build a range out of the functions. That can allow you to
test performance with your lower layers. One of the awesome
things about iopipe is testing correctness is really easy --
every string is also an iopipe :)
I actually worked with a person at dconf on a similar (maybe
identical?) format and explained how it could be done in a very
similar way. He was looking to remove data that had a low
percentage of correctness (or something like that, not in
bioinformatics, so I don't understand the real semantics).
With this mechanism in hand, the decompression is pretty easy
to chain together with whatever actual stream you have, just
use iopipe.zip.
Good luck, and email me if you need more help
(schvei...@yahoo.com).
-Steve
Hi Nic and Steve
Thank you both very much for your inputs. I am trying to make use
of them. I will try to adapt jsoniopipe for fasta. This is on
going and broken code: https://github.com/biocyberman/fastaq .
PRs are welcome.
@Nic:
I am too very interested in bringing D to bioinformatics. I will
be happy to share information I have. Feel free to email me at
vql(.at.)rn.dk and we talk further about it.
@Steve: Yes we talked at dconf 2017. I had to other things so D
learning got slow down. I am trying with Fasta format before
jumping to Fastq again. The jsoniopipe is full feature, and
relatively small project, which can be used to study case.
However there are some aspects I still haven't fully understood.
Would I be lucky enough to have you make the current broken code
of fastaq to work? :) That will definitely save me time and
headache dealing with newbie problems.