[jira] [Work logged] (BEAM-10670) Make non-portable Splittable DoFn the only option when executing Java "Read" transforms

ASF GitHub Bot (Jira) Mon, 10 May 2021 10:41:32 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-10670?focusedWorklogId=594134&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-594134
 ]


ASF GitHub Bot logged work on BEAM-10670:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/May/21 17:40
            Start Date: 10/May/21 17:40
    Worklog Time Spent: 10m 
      Work Description: iemejia commented on pull request #14755:
URL: https://github.com/apache/beam/pull/14755#issuecomment-837026342


   > Is there any typical source where you notice the regression? 
   
   We have received reports on multiple sources but in particular on bounded 
file based sources. I was surprised because I noticed this even with ParquetIO 
(that is based on a SDF based implementation) and by simply adding 
`--experiments=use_deprecated_read` we won back a performance benefit of at 
least 20% on our tests benchmarks (TPC-DS query 3 on Spark runner).
   
   You can easily reproduce this by running pipelines with big amounts of data, 
the difference with tiny data is really low to even notice and even sometimes 
better for SDF (e.g. in the Nexmark CI tests).
   
   I tried to reproduce the bounded performance regression on Dataflow but in 
Dataflow I do not see any considerable consistent performance difference by 
using or not `use_deprecated_read`.
   
   > And SDF read is default only for Spark Streaming, I'm curious about what 
kind of performance we are talking about here. Is it the throughput per second 
or watermark lag?
   
   Spark Streaming is still using the Read.Unbounded path, this has not yet 
been migrated to SDF. Luke was working on this but this was not finished when 
he left, for ref #13101
   
   The regression for unbounded reads (the perceived delay to get messages) on 
Direct Runner reported by @steveniemitz is probably a sufficient reason to 
revert SDF by default for direct runner and if we add the performance issues 
[reported on 
Flink](https://the-asf.slack.com/archives/C9H0YNP3P/p1607057900393900) too  I 
think we must return everything back to the traditional Read based translation 
until we have consistent results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 594134)
    Time Spent: 39h 40m  (was: 39.5h)

> Make non-portable Splittable DoFn the only option when executing Java "Read" 
> transforms
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-10670
>                 URL: https://issues.apache.org/jira/browse/BEAM-10670
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>            Reporter: Luke Cwik
>            Assignee: Ismaël Mejía
>            Priority: P1
>              Labels: Clarified
>             Fix For: 2.30.0
>
>          Time Spent: 39h 40m
>  Remaining Estimate: 0h
>
> All runners seem to be capable of migrating to splittable DoFn for 
> non-portable execution except for Dataflow runner v1 which will internalize 
> the current primitive read implementation that is shared across runner 
> implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-10670) Make non-portable Splittable DoFn the only option when executing Java "Read" transforms

Reply via email to