[
https://issues.apache.org/jira/browse/BEAM-6670?focusedWorklogId=215477&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-215477
]
ASF GitHub Bot logged work on BEAM-6670:
----------------------------------------
Author: ASF GitHub Bot
Created on: 19/Mar/19 13:59
Start Date: 19/Mar/19 13:59
Worklog Time Spent: 10m
Work Description: Noctune commented on issue #7956: [BEAM-6670] Add
option to disable reshuffling of JdbcIO
URL: https://github.com/apache/beam/pull/7956#issuecomment-474382326
I did sort of expect this to be "controversial". :)
>"some pipeline behaves catastrophically bad in a way that can only be
prevented by adding the tuning parameter, and it can not be chosen
automatically".
Admittedly, it is not fair to say that the presence of reshuffle is
*catastrophic*, but it currently does approx. twice as much IO as necessary. I
have not had the chance to test it without the reshuffle, but I will see if I
can find some time next week.
It also clutters the graph up a lot, which is a problem in Flink as the GUI
does *not* perform well for large graphs, so I generally try to minimize the
size of the graph.
Although, I suppose the situation I describe could be detected automatically
by inspecting the graph and determining if the `JdbcIO` leads directly to a
group by. A more general optimization could be to remove sequential reshuffles
automatically and have some method of marking certain internal transforms as
"cheap" (like constructing `RawUnionValue`s for a `CoGroupByKey`). I.e. merge
reshuffles if they are only separated by "cheap" transforms.
>Even when this standard is met, it is better to let the user specify hints
in declarative rather than operational terms
Yeah, I thought about that. I could not really figure out a more suitable
name. `withHintDirectlyToGroupBy`, maybe?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 215477)
Time Spent: 40m (was: 0.5h)
> Option to disable reparallelization in JdbcIO.Read
> --------------------------------------------------
>
> Key: BEAM-6670
> URL: https://issues.apache.org/jira/browse/BEAM-6670
> Project: Beam
> Issue Type: Wish
> Components: io-java-jdbc
> Reporter: Mike Pedersen
> Priority: Minor
> Time Spent: 40m
> Remaining Estimate: 0h
>
> I'm doing approx. 20 JDBC queries against a database and then joining them
> together in a group by. Every single one of these queries does a reshuffle,
> which is sort of useless due to them being fed to a CoGroupByKey immediately
> afterwards.
> Reshuffle by default seems sensible by the principle of least surprise, but
> it would be nice to have a way to disable it when it's not necessary. For
> example a "withReshuffle(boolean)" method.
> This should be an easy addition and I am willing to add this if it sounds
> reasonable enough.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)