Hello Simon,

zimoun <zimon.touto...@gmail.com> writes:

> Hi,
>
> I am asking if it should be possible to optionally stream the
> inputs/outputs when the workflow is processed without writing the
> intermediate files on disk.
>
> Well, a workflow is basically:
>  - some process units (or task or rule) that take inputs (file) and
> produce outputs (other file)
>  - a graph that describes the relationship of theses units.
>
> The simplest workflow is:
>     x --A--> y --B--> z
>  - process A: input file x, output file y
>  - process B: input file y, output file z
>
> Currently, the file y is written on disk by A then read by B. Which
> leads to IO inefficiency. Especially when the file is large. And/or
> when there is several same kind of unit done in parallel.
>
>
> Should be a good idea to have something like the shell pipe `|` to
> compose the process unit ?
> If yes how ? I have no clue where to look...

That's an interesting idea.  Of course, you could literally use the
shell pipe within a single process.  And I think this makes sense, because
if a shell pipe is beneficial in your situation, then it is likely to be
beneficial to run the two programs connected by the pipe on a single
computer / in a single job.

Here's an example:
(define-public A
  (process
    (name "A")
    (package-inputs (list samtools gzip))
    (data-inputs "/tmp/sample.sam")
    (outputs "/tmp/sample.sam.gz")
    (procedure
     #~(system (string-append "samtools view " #$data-inputs
                              " | gzip -c > " #$outputs)))))

> I agree that the storage of intermediate files avoid to compute again
> and again unmodified part of the workflow. In this saves time when
> developing the workflow.
> However, the storage of temporary files appears unnecessary once the
> workflow is done and when it does not need to run on cluster.

I think it's either an efficient data transfer (using a pipe), or
writing to disk in between for better restore points.  We cannot have
both.  The former can already be achieved with the shell pipe, and the
latter can be achieved by writing two processes.

Maybe we can come up with a convenient way to combine two processes
using a shell pipe.  But this needs more thought!

If you have an idea to improve on this, please do share. :-)

> Thank you for all the work about the Guix ecosystem.
>
> All the best,
> simon

Thanks!

Kind regards,
Roel Janssen

Reply via email to