Re: [Bioc-devel] writeVcf performance

Hervé Pagès Tue, 09 Sep 2014 17:05:06 -0700

Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before
reporting this. Thanks and sorry for the noise.


H.

On 09/09/2014 03:09 PM, Valerie Obenchain wrote:

Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.

Val

On 09/09/2014 02:47 PM, Hervé Pagès wrote:

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)

source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.

Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <voben...@fhcrc.org
<mailto:voben...@fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing
now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files
with >=
    .5 million rows and the improvement over writeVcf() in release is
~ 90%.

    Valerie




    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val


        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files
make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz





<ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>


            took about ~4.2 hrs to write out, and is about 1.5x the
size
            of the
            files we are actually dealing with (~50M ranges vs our
~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G


            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.mich...@gene.com
            <mailto:lawrence.mich...@gene.com>
            <mailto:lawrence.michael@gene.__com
            <mailto:lawrence.mich...@gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel
algorithm.
                 Unfortunately there is some frustrating overhead with
the
                 parallelism. But I'm glad Val is arriving at
something
            quicker.

                 Michael


                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
                 <mailto:mtmor...@fhcrc.org
            <mailto:mtmor...@fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.


                     I think Val is arriving at a (much) more
efficient
                     implementation, but...

                     I wanted to share my guess that the poor
_scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of
rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is
written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 *
nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz,
length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {
                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }


                     running under R-devel with $ SPLIT=TRUE R
--no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent
chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit
from
                     vectorization, but that do not entail
allocation of
            large
                     numbers of in-use SEXPs.

                     Martin


                         Sorry for the confusion.
                         ~G


                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <becke...@gene.com <mailto:becke...@gene.com>
            <mailto:becke...@gene.com <mailto:becke...@gene.com>>
                         <mailto:becke...@gene.com
            <mailto:becke...@gene.com> <mailto:becke...@gene.com
            <mailto:becke...@gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file
(e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G


                              On Tue, Aug 26, 2014 at 11:57 AM,
Michael
            Lawrence
                         <micha...@gene.com <mailto:micha...@gene.com>
            <mailto:micha...@gene.com <mailto:micha...@gene.com>>
                              <mailto:micha...@gene.com
            <mailto:micha...@gene.com> <mailto:micha...@gene.com
            <mailto:micha...@gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.


                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>
            <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>
                                  <mailto:mtmor...@fhcrc.org
            <mailto:mtmor...@fhcrc.org>
                         <mailto:mtmor...@fhcrc.org
            <mailto:mtmor...@fhcrc.org>>>> wrote:

                                      I didn't see in the original
thread a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin


                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the
heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.


                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>
                         <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>> <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>
                         <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,



https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html





<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>






<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html





<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>






<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html





<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>



<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html

<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that
right or
            should we
                         still consider this on
                                              the table?


                                              Val


                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us
when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes
for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G


                                                  On Tue, Aug 5,
2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>
                         <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>> <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>
                         <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>>>

              <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>
                         <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>> <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>
                         <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when
we're
            both back in
                         the office.

                                                        Val





                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone
have
            an itch for
                         optimization? The
                                                  writeVcf function is

currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million
row
                                                            gVCF, it
was
            taking 2.25
                         hours prior to
                                                  yesterday's
improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too
long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and
associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores
and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests
that
                                                            the
running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's

string/memory

allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.


set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed,
even
            this takes a
                         long time (in a
                                                  fresh session):


set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But
running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):


              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know
Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael


            [[alternative
                         HTML version deleted]]




            _______________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>

              <mailto:Bioc-devel@r-project.
            <mailto:Bioc-devel@r-project.>____org
                         <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>>

              <mailto:Bioc-devel@r-project
<mailto:Bioc-devel@r-project>.
                         <mailto:Bioc-devel@r-project
            <mailto:Bioc-devel@r-project>.>______org

              <mailto:Bioc-devel@r-project.
            <mailto:Bioc-devel@r-project.>____org
                         <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>>>
                                                            mailing
list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>


            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>



            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>


            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>




            _______________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>

              <mailto:Bioc-devel@r-project.
            <mailto:Bioc-devel@r-project.>____org
                         <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>>

              <mailto:Bioc-devel@r-project
<mailto:Bioc-devel@r-project>.
                         <mailto:Bioc-devel@r-project
            <mailto:Bioc-devel@r-project>.>______org

              <mailto:Bioc-devel@r-project.
            <mailto:Bioc-devel@r-project.>____org
                         <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>


            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>





            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>


            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>




                                                  --
                                                  Computational
Biologist
                                                  Genentech Research





                                                   [[alternative HTML
            version
            deleted]]



            _____________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>
                         <mailto:Bioc-devel@r-project.
            <mailto:Bioc-devel@r-project.>____org
                         <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>


            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>



                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1
B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>






                              --
                              Computational Biologist
                              Genentech Research




                         --
                         Computational Biologist
                         Genentech Research



                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>





            --
            Computational Biologist
            Genentech Research


        _________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>





--
Computational Biologist
Genentech Research


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] writeVcf performance

Reply via email to