[jira] [Created] (FLINK-19901) Unable to exclude metrics variables for the last metrics reporter.

2020-10-30 Thread Truong Duc Kien (Jira)
Truong Duc Kien created FLINK-19901:
---

 Summary: Unable to exclude metrics variables for the last metrics 
reporter.
 Key: FLINK-19901
 URL: https://issues.apache.org/jira/browse/FLINK-19901
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.11.2
Reporter: Truong Duc Kien


We discovered a bug that leads to the setting {{scope.variables.excludes}} 
being ignored for the very last metric reporter.

Because {{reporterIndex}} was incremented before the length check, the last 
metrics reporter setting is overflowed back to 0.

Interestingly, this bug does not trigger when there's only one metric reporter, 
because slot 0 is actually overwritten with that reporter's variables instead 
of being used to store all variables in that case.
{code:java}
public abstract class AbstractMetricGroup> 
implements MetricGroup {
...
public Map getAllVariables(int reporterIndex, 
Set excludedVariables) {
// offset cache location to account for general cache at 
position 0
reporterIndex += 1;
if (reporterIndex < 0 || reporterIndex >= 
logicalScopeStrings.length) {
reporterIndex = 0;
}
// if no variables are excluded (which is the default!) we 
re-use the general variables map to save space
return internalGetAllVariables(excludedVariables.isEmpty() ? 0 
: reporterIndex, excludedVariables);
}

...
{code}
 [Github link to the above 
code|https://github.com/apache/flink/blob/3bf5786655c3bb914ce02ebb0e4a1863b205b829/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L122]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19519) Add support for port range in taskmanager.data.port

2020-10-07 Thread Truong Duc Kien (Jira)
Truong Duc Kien created FLINK-19519:
---

 Summary: Add support for port range in taskmanager.data.port
 Key: FLINK-19519
 URL: https://issues.apache.org/jira/browse/FLINK-19519
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.11.2
Reporter: Truong Duc Kien


Flink should add support for port range in {{taskmanager.data.port}}

This is very helpful when running in restrictive network environments, and port 
range are already available for many other port settings.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18934) Idle stream does not advance watermark in connected stream

2020-08-13 Thread Truong Duc Kien (Jira)
Truong Duc Kien created FLINK-18934:
---

 Summary: Idle stream does not advance watermark in connected stream
 Key: FLINK-18934
 URL: https://issues.apache.org/jira/browse/FLINK-18934
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.11.1
Reporter: Truong Duc Kien


Per Flink documents, when a stream is idle, it will allow watermarks of 
downstream operator to advance. However, when I connect an active data stream 
with an idle data stream, the output watermark of the CoProcessOperator does 
not increase.

Here's a small test that reproduces the problem.

https://github.com/kien-truong/flink-idleness-testing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18740) Support rate limiter in the universal kafka connector

2020-07-28 Thread Truong Duc Kien (Jira)
Truong Duc Kien created FLINK-18740:
---

 Summary: Support rate limiter in the universal kafka connector
 Key: FLINK-18740
 URL: https://issues.apache.org/jira/browse/FLINK-18740
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Kafka
Affects Versions: 1.11.1
Reporter: Truong Duc Kien


Currently rate limiter is only available for kafka connector 010,  but not the 
universal connector.{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18554) Memory exceeds taskmanager.memory.process.size when enabling mmap_read for RocksDB

2020-07-10 Thread Truong Duc Kien (Jira)
Truong Duc Kien created FLINK-18554:
---

 Summary: Memory exceeds taskmanager.memory.process.size when 
enabling mmap_read for RocksDB
 Key: FLINK-18554
 URL: https://issues.apache.org/jira/browse/FLINK-18554
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Configuration
Affects Versions: 1.11.0
Reporter: Truong Duc Kien


We are testing Flink automatic memory management feature on Flink 1.11. 
However, YARN kept killing our containers due to the processes' physical memory 
exceeds the limit, although we have tuned the following configurations:
{code:java}
taskmanager.memory.process.size
taskmanager.memory.managed.fraction
{code}
We suspect that it's because we have enabled mmap_read for RocksDB, since 
turning this options off seems to fix the issue. Maybe Flink automatic memory 
management is unable to account for the addition memory required when using 
mmap_read ?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18527) Task Managers fail to start on HDP 2.6.5 due to commons-cli conflict

2020-07-08 Thread Truong Duc Kien (Jira)
Truong Duc Kien created FLINK-18527:
---

 Summary: Task Managers fail to start on HDP 2.6.5 due to 
commons-cli conflict 
 Key: FLINK-18527
 URL: https://issues.apache.org/jira/browse/FLINK-18527
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.11.0
Reporter: Truong Duc Kien


When launching a new job in HDP 2.6.5, we are encountering these exceptions

 
{code:java}
2020-07-08 16:10:36 E [default-dispatcher-4] [   o.a.f.y.YarnResourceManager] 
Could not start TaskManager in container container_xxx
java.lang.NoSuchMethodError: 
org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;
      at 
org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]

2020-07-08 16:12:46 E [default-dispatcher-4] [   o.a.f.y.YarnResourceManager] 
Could not start TaskManager in container container_xxx {} [] 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.flink.runtime.entrypoint.parser.CommandLineOptions


{code}
We figure this is because HDP 2.6.5 is putting commons-cli version 1.2 on the 
class path, while Flink is expecting version 1.3. Maybe commons-cli should also 
be shaded to avoid such issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-11283) Accessing the key when processing connected keyed stream

2019-01-08 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-11283:
---

 Summary: Accessing the key when processing connected keyed stream
 Key: FLINK-11283
 URL: https://issues.apache.org/jira/browse/FLINK-11283
 Project: Flink
  Issue Type: Improvement
  Components: DataStream API
Reporter: Truong Duc Kien


Currently, we can access the key when using \{{ KeyedProcessedFunction }} .

Simillar functionality would be very useful when processing connected keyed 
stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11148) Rescaling operator regression in Flink 1.7

2018-12-13 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-11148:
---

 Summary: Rescaling operator regression in Flink 1.7
 Key: FLINK-11148
 URL: https://issues.apache.org/jira/browse/FLINK-11148
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: Truong Duc Kien


We have a job using 20 TaskManager with 3 slot each.

Using Flink 1.4, when we rescale a data stream from 60 to 20, each TaskManager 
will only have one downstream slot, that receives the data from 3 upstream 
slots in the same TaskManager.

Using Flink 1.7, this behaviour no longer hold true, multiple downstream slots 
are being assigned to the same TaskManager. This change is causing imbalance in 
our TaskManager load.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10857) Conflict between JMX and Prometheus Metrics reporter

2018-11-12 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-10857:
---

 Summary: Conflict between JMX and Prometheus Metrics reporter
 Key: FLINK-10857
 URL: https://issues.apache.org/jira/browse/FLINK-10857
 Project: Flink
  Issue Type: Bug
  Components: Metrics
Affects Versions: 1.6.2
Reporter: Truong Duc Kien


When registering both JMX and Prometheus metrics reporter, the Prometheus 
reporter will fail with an exception.

 
{code:java}
o.a.f.r.m.MetricRegistryImpl Error while registering metric.
java.lang.IllegalArgumentException: Invalid metric name: 
flink_jobmanager.Status.JVM.Memory.Mapped_Count
at 
org.apache.flink.shaded.io.prometheus.client.Collector.checkMetricName(Collector.java:182)
at 
org.apache.flink.shaded.io.prometheus.client.SimpleCollector.(SimpleCollector.java:164)
at 
org.apache.flink.shaded.io.prometheus.client.Gauge.(Gauge.java:68)
at 
org.apache.flink.shaded.io.prometheus.client.Gauge$Builder.create(Gauge.java:74)
at 
org.apache.flink.metrics.prometheus.AbstractPrometheusReporter.createCollector(AbstractPrometheusReporter.java:130)
at 
org.apache.flink.metrics.prometheus.AbstractPrometheusReporter.notifyOfAddedMetric(AbstractPrometheusReporter.java:106)
at 
org.apache.flink.runtime.metrics.MetricRegistryImpl.register(MetricRegistryImpl.java:329)
at 
org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.addMetric(AbstractMetricGroup.java:379)
at 
org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.gauge(AbstractMetricGroup.java:323)
at 
org.apache.flink.runtime.metrics.util.MetricUtils.instantiateMemoryMetrics(MetricUtils.java:231)
at 
org.apache.flink.runtime.metrics.util.MetricUtils.instantiateStatusMetrics(MetricUtils.java:100)
at 
org.apache.flink.runtime.metrics.util.MetricUtils.instantiateJobManagerMetricGroup(MetricUtils.java:68)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startClusterComponents(ClusterEntrypoint.java:342)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:233)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:191)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
at 
org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobClusterEntrypoint.java:176)
{code}
 

This is a small program to reproduce the problem:

https://github.com/dikei/flink-metrics-conflict-test

 

I



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9628) Options to tolerate truncate failure in BucketingSink

2018-06-20 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9628:
--

 Summary: Options to tolerate truncate failure in BucketingSink
 Key: FLINK-9628
 URL: https://issues.apache.org/jira/browse/FLINK-9628
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.5.0
Reporter: Truong Duc Kien


When the target filesystem is corrupted, truncate operation might fail 
permanently, causing the job to restart repeatedly and unable to progress. 

There should be an option to ignore these kind of errors and allows the program 
to continue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9583) Wrong number of TaskManagers' slots after recovery.

2018-06-13 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9583:
--

 Summary: Wrong number of TaskManagers' slots after recovery.
 Key: FLINK-9583
 URL: https://issues.apache.org/jira/browse/FLINK-9583
 Project: Flink
  Issue Type: Bug
  Components: ResourceManager
Affects Versions: 1.5.0
 Environment: Flink 1.5.0 on YARN with the default execution mode.
Reporter: Truong Duc Kien
 Attachments: jm.log

We started a job with 120 slots, using a FixedDelayRestart strategy with the 
delay of 1 minutes.

During recovery, some but not all Slots were released.

When the job restarts again, Flink requests a new batch of slots.

The total number of slots is now 193, larger than the configured amount, but 
the excess slots are never released.

 

This bug does not happen with legacy mode. I've attach the job manager log.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9552) NPE in SpanningRecordSerializer during checkpoint

2018-06-07 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9552:
--

 Summary: NPE in SpanningRecordSerializer during checkpoint
 Key: FLINK-9552
 URL: https://issues.apache.org/jira/browse/FLINK-9552
 Project: Flink
  Issue Type: Bug
  Components: Type Serialization System
Affects Versions: 1.5.0
Reporter: Truong Duc Kien


We're encountering NPE intermittently inside SpanningRecordSerializer during 
checkpoint.

 
{code:java}
2018-06-08 08:31:35,741 [ka.actor.default-dispatcher-83] INFO  
o.a.f.r.e.ExecutionGraph IterationSource-22 (44/120) 
(c1b94ef849db0e5fb9fb7b85c17073ce) switched from RUNNING to FAILED.
java.lang.RuntimeException
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110)
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
at 
org.apache.flink.streaming.runtime.tasks.StreamIterationHead.run(StreamIterationHead.java:91)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:98)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:129)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:105)
at 
org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit(StreamRecordWriter.java:81)
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
... 5 more
2018-06-08 08:31:35,746 [ka.actor.default-dispatcher-83] INFO  
o.a.f.r.e.ExecutionGraph Job xxx (8a4eaf02c46dc21c7d6f3f70657dbb17) switched 
from state RUNNING to FAILING.
java.lang.RuntimeException
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110)
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
at 
org.apache.flink.streaming.runtime.tasks.StreamIterationHead.run(StreamIterationHead.java:91)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:98)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:129)
at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:105)
at 
org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit(StreamRecordWriter.java:81)
at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
... 5 more
{code}
This issue is probably concurrency related, because the revelant Flink code 
seems to have proper null checking 

https://github.com/apache/flink/blob/fa024726bb801fc71cec5cc303cac1d4a03f555e/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/serialization/SpanningRecordSerializer.java#L98
{code:java}
// Copy from intermediate buffers to current target memory segment
if (targetBuffer != null) {
targetBuffer.append(lengthBuffer);
targetBuffer.append(dataBuffer);
targetBuffer.commit();
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9467) No Watermark display on Web UI

2018-05-29 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9467:
--

 Summary: No Watermark display on Web UI
 Key: FLINK-9467
 URL: https://issues.apache.org/jira/browse/FLINK-9467
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Truong Duc Kien


Watermark is currently not shown on the web interface,  because it still 
queries for watermark using the old metric name `currentLowWatermark` instead 
of the new ones `currentInputWatermark` and `currentOutputWatermark`

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9465) Separate timeout for savepoint and checkpoint

2018-05-29 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9465:
--

 Summary: Separate timeout for savepoint and checkpoint
 Key: FLINK-9465
 URL: https://issues.apache.org/jira/browse/FLINK-9465
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.5.0
Reporter: Truong Duc Kien


Savepoint can take much longer time to perform than checkpoint, especially with 
incremental checkpoint enabled. This leads to a couple of troubles:
 * For our job, we currently have to set the checkpoint timeout much large than 
necessary, otherwise we would be unable to perform savepoint. 
 * During rush hour, our cluster would encounter high rate of checkpoint 
timeout due to backpressure, however we're unable to migrate to a larger 
configuration, because savepoint also timeout.

In my opinion, the timeout for savepoint should be configurable separately, 
both in the config file and as parameter to the savepoint command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9459) Maven enforcer plugin prevents compilation with HDP's Hadoop

2018-05-28 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9459:
--

 Summary: Maven enforcer plugin prevents compilation with HDP's 
Hadoop
 Key: FLINK-9459
 URL: https://issues.apache.org/jira/browse/FLINK-9459
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.5.0
Reporter: Truong Duc Kien


Compiling Flink with Hortonwork HDP's version of Hadoop is currently 
unsuccessful due to Enforce Plugin catches a problem with their Hadoop.

 

The command used is

 
{noformat}
mvn clean install -DskipTests -Dcheckstyle.skip=true -Dmaven.javadoc.skip=true  
-Pvendor-repos -Dhadoop.version=2.7.3.2.6.5.0-292
{noformat}
 

The problems:
{noformat}
Dependency convergence error for com.fasterxml.jackson.core:jackson-core:2.6.0 
paths to dependency are: 
+-org.apache.flink:flink-bucketing-sink-test:1.5-SNAPSHOT   
 +-org.apache.flink:flink-shaded-hadoop2:1.5-SNAPSHOT 
   +-com.microsoft.azure:azure-storage:5.4.0 
 +-com.fasterxml.jackson.core:jackson-core:2.6.0 

and 

+-org.apache.flink:flink-bucketing-sink-test:1.5-SNAPSHOT 
 +-org.apache.flink:flink-shaded-hadoop2:1.5-SNAPSHOT 
   +-com.fasterxml.jackson.core:jackson-core:2.6.0 

and 
+-org.apache.flink:flink-bucketing-sink-test:1.5-SNAPSHOT 
 +-org.apache.flink:flink-shaded-hadoop2:1.5-SNAPSHOT 
   +-com.fasterxml.jackson.core:jackson-databind:2.2.3 
 +-com.fasterxml.jackson.core:jackson-core:2.2.3 

[WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence 
failed with message: Failed while enforcing releasability. See above detailed 
error message. 

[INFO] FAILURE build of project org.apache.flink:flink-bucketing-sink-test   
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9458) Unable to recover from job failure on YARN with NPE

2018-05-28 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9458:
--

 Summary: Unable to recover from job failure on YARN with NPE
 Key: FLINK-9458
 URL: https://issues.apache.org/jira/browse/FLINK-9458
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.5.0
 Environment: Ambari HDP 2.6.3

Hadoop 2.7.3

 

Job configuration:

120 Task Managers x 1 slots 

 

 
Reporter: Truong Duc Kien


After upgrading our job to Flink 1.5, they are unable to recover from failure 
with the following exception appears repeatedly


{noformat}
2018-05-29 04:56:06,086 [ jobmanager-future-thread-36] INFO 
o.a.f.r.e.ExecutionGraph Try to restart or fail the job xxx 
(23d9e87bf43ce163ff7db8afb062fb1d) if no longer possible. 2018-05-29 
04:56:06,086 [ jobmanager-future-thread-36] INFO o.a.f.r.e.ExecutionGraph Job 
xxx (23d9e87bf43ce163ff7db8afb062fb1d) switched from state RESTARTING to 
RESTARTING. 2018-05-29 04:56:06,086 [ jobmanager-future-thread-36] INFO 
o.a.f.r.e.ExecutionGraph Restarting the job xxx 
(23d9e87bf43ce163ff7db8afb062fb1d). 2018-05-29 04:57:06,086 [ 
jobmanager-future-thread-36] WARN o.a.f.r.e.ExecutionGraph Failed to restart 
the job. java.lang.NullPointerException at 
org.apache.flink.runtime.jobmanager.scheduler.CoLocationConstraint.isAssignedAndAlive(CoLocationConstraint.java:104)
 at 
org.apache.flink.runtime.jobmanager.scheduler.CoLocationGroup.resetConstraints(CoLocationGroup.java:119)
 at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1247)
 at 
org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
 at 
org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9070) Improve performance of RocksDBMapState.clear()

2018-03-25 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-9070:
--

 Summary: Improve performance of RocksDBMapState.clear()
 Key: FLINK-9070
 URL: https://issues.apache.org/jira/browse/FLINK-9070
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Affects Versions: 1.6.0
Reporter: Truong Duc Kien


Currently, RocksDBMapState.clear() is implemented by iterating over all the 
keys and drop them one by one. This iteration can be quite slow with: 
 * Large maps
 * High-churn maps with a lot of tombstones

There are a few methods to speed-up deletion for a range of keys, each with 
their own caveats:
 * DeleteRange: still experimental, likely buggy
 * DeleteFilesInRange + CompactRange: only good for large ranges

 

Flink can also keep a list of inserted keys in-memory, then directly delete 
them without having to iterate over the Rocksdb database again. 

 

Reference:
 * [RocksDB article about range 
deletion|https://github.com/facebook/rocksdb/wiki/Delete-A-Range-Of-Keys]
 * [Bug in DeleteRange|https://pingcap.com/blog/2017-09-08-rocksdbbug]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-8949) Rest API failure with long URL

2018-03-15 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-8949:
--

 Summary: Rest API failure with long URL
 Key: FLINK-8949
 URL: https://issues.apache.org/jira/browse/FLINK-8949
 Project: Flink
  Issue Type: Bug
  Components: REST, Webfrontend
Affects Versions: 1.4.2
Reporter: Truong Duc Kien


When you have jobs with high parallelism, the URL for a REST request can get 
very long. When the URL is longer than 4096 bytes, the  REST API will return 
error

{{Failure: 404 Not Found}}

 This can easily be seen in the Web UI, when Flink queries for the watermark 
using the REST API:

{{GET 
/jobs/:jobId/vertices/:vertexId/metrics?get=0.currentLowWatermark,1.currentLowWatermark,2.currentLo...}}

The request will fail with more than 170 subtasks and the watermark will not be 
displayed in the Web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-8946) TaskManager stop sending metrics after JobManager failover

2018-03-14 Thread Truong Duc Kien (JIRA)
Truong Duc Kien created FLINK-8946:
--

 Summary: TaskManager stop sending metrics after JobManager failover
 Key: FLINK-8946
 URL: https://issues.apache.org/jira/browse/FLINK-8946
 Project: Flink
  Issue Type: Bug
  Components: Metrics, TaskManager
Affects Versions: 1.4.2
Reporter: Truong Duc Kien


Running in Yarn-standalone mode, when the Job Manager performs a failover, all 
TaskManager that are inherited from the previous JobManager will not send 
metrics to the new JobManager and other registered metric reporters.

 

A cursory glance reveal that these line of code might be the cause

[https://github.com/apache/flink/blob/a3478fdfa0f792104123fefbd9bdf01f5029de51/flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala#L1082-L1086]

Perhap the TaskManager close its metrics group when disassociating JobManager, 
but not creating a new one on fail-over association ?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)