tibrewalpratik17 opened a new pull request, #13036:
URL: https://github.com/apache/pinot/pull/13036
We have intermittently seen issues in our clusters while creating
streamMessageDecoder. Stack trace:
```
java.lang.RuntimeException: Caught exception while creating
StreamMessageDecoder from stream config:
StreamConfig{_type='kafka', _topicName='<redacted>',
_consumerTypes=[LOWLEVEL], _consumerFactoryClassName='redacted>',
_offsetCriteria='OffsetCriteria{_offsetType=LARGEST,
_offsetString='largest'}', _connectionTimeoutMillis=30000,
_fetchTimeoutMillis=5000, _idleTimeoutMillis=180000,
_flushThresholdRows=80000000,
_flushThresholdTimeMillis=86400000, _flushSegmentDesiredSizeBytes=209715200,
_flushAutotuneInitialRows=100000, _decoderClass='redacted',
_decoderProperties={}, _groupId='null', _topicConsumptionRateLimit=-1.0,
_tableNameWithType='redacted'}
at
org.apache.pinot.spi.stream.StreamDecoderProvider.create(StreamDecoderProvider.java:48)
at
org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.(LLRealtimeSegmentDataManager.java:1424)
at
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:446)
at
org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:228)
at
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:168)
at
org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:83)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:350)
at
org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:278)
at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)
at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
```
This stops consumption in one of the replicas and once the other replica
starts committing, this stopped replica always ends up in ERROR state. The only
way to fix this is to reset this replica's segment.
The behaviour of not consuming in one replica is also dangerous as if the
other replica's hosts restarts / goes down due to any reason, it can cause data
loss scenarios.
Having a retry policy during StreamMessageDecoder.create() may help reduce
the chances of such scenarios.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]