Jérémie Vexiau created BEAM-2803:
------------------------------------
Summary: JdbcIO read is very slow when query return a lot of rows
Key: BEAM-2803
URL: https://issues.apache.org/jira/browse/BEAM-2803
Project: Beam
Issue Type: Improvement
Components: sdk-java-extensions
Affects Versions: Not applicable
Reporter: Jérémie Vexiau
Assignee: Reuven Lax
Fix For: Not applicable
Hi,
I'm using JdbcIO reader in batch mode with the postgresql driver.
my select query return more than 5 Millions rows
using cursors with Statement.setFetchSize().
these ParDo are OK :
{code:java}
.apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
.apply(ParDo.of(new DoFn<T, KV<Integer, T>>() {
private Random random;
@Setup
public void setup() {
random = new Random();
}
@ProcessElement
public void processElement(ProcessContext context) {
context.output(KV.of(random.nextInt(), context.element()));
}
}))
{code}
but reshuffle is very very slow.
it must be the GroupByKey with more than 5 millions of Key.
{code:java}
.apply(GroupByKey.<Integer, T>create())
{code}
is there a way to optimize the reshuffle, or use another method to prevent
fusion ?
thanks in advance,
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)