[
https://issues.apache.org/jira/browse/STORM-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394106#comment-14394106
]
Jungtaek Lim commented on STORM-738:
------------------------------------
I thought overflow-control once more, and there seems to two different
approaches to get over.
A. ShellBolt side control
We can modify ShellBolt to have sent tuple ids list, and stop sending tuples
when list exceeds configured max value. In order to achieve this, subprocess
should notify "tuple id is complete" to ShellBolt.
- It introduces new commands for multi-lang, "proceed" (or better name)
- ShellBolt stores in-progress-of-processing tuples list.
- Its overhead could be big, subprocess should always notify to ShellBolt when
any tuples are processed.
B. subprocess side control
We can modify subprocess to check pending queue after reading tuple.
If it exceeds configured max value, subprocess can request "delay" to ShellBolt
for slowing down.
When ShellBolt receives "delay", BoltWriterRunnable should stop polling pending
queue and continue polling later.
How long ShellBolt wait for resending? Its unit would be "delay time" or "tuple
count". I don't know which is better yet.
- It introduces new commands for multi-lang, "delay" (or better name)
- I don't think it would be introduced soon, but subprocess can request delay
based on own statistics. (ex. pending tuple count * average tuple processed
time for time unit, average pending tuple count for count unit)
-- We can leave when and how much to request "delay" to user. User can make
his/her own algorithm to control flooding.
In my opinion B seems to more natural cause current issue is by subprocess side
so it would be better to let subprocess overcome it.
> Multilang needs Overflow-Control mechanism and HeartBeat timeout problem
> ------------------------------------------------------------------------
>
> Key: STORM-738
> URL: https://issues.apache.org/jira/browse/STORM-738
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.10.0, 0.9.3-rc2, 0.9.4, 0.11.0
> Reporter: DashengJu
> Priority: Critical
> Attachments: storm_multilang.png
>
>
> hi, all
> we have a topology, which have 3 components(spout->parser->saver) and the
> parser is Multilang bolt with python. We do not use ACK mechanism.
> we found 2 problems with Mutilang python script.
> 1) the parser python scripts may hold too many tuples and consume too many
> memory;
> 2) with MultiLang heartbeat mechanism described by
> https://issues.apache.org/jira/browse/STORM-513, the python script always
> timeout to heartbeat, even when the parser bolt is normal, cause supervisor
> to restart itself.
> !storm_multilang.png!
> ShellBolt process === Father-Process
> PythonScript process === Child-Process
> The reason is :
> 1) when topology do not use ACK mechanism, the spout do not have
> Overflow-control ability, if the stream have too many tuples comes, spout
> will send all the tuples to parser's ShellBolt process(Father-Process);
> 2) parser's ShellBolt process just put the tuples to _pendingWrites queue, if
> the _pendingWrites queue does not have limit;
> 3) parser's PythonScript process(Child-Process) call readMsg() to read a
> tuple from STDIN, handle the tuple, and emit a new tuple to its father
> process through STDOUT, and then call readTaskIds() from STDIN. Because
> Father-Process's queue already have too many other tuples, Child-Process will
> read all the tuples to pending_commands, util received TaskIds.
> 4) so Child-Process process's pending_commands may contains too many tuples
> and consume too many memory.
> As to heartbeat, because there are too many pending_commands need
> Child-Process to handle, and Child-Process's every emit operation will need
> more I/O read operations from STDIN. It may need 10 seconds to handle one
> tuple, and this will cause the heartbeat tuple not handle quickly, and
> timeout will happen.
> Even if Father-Process's _pendingWrites have limits, for example 1000,
> Child-Process may needs 1000 x 1000 read operations then it can handle the
> heartbeat tuple.
> [~revans2] [~kabhwan] this related to Multilang and heartbeat, please help to
> confirm the two problems.
> I think Father-Process and Child-Process need Overflow-Control Protocol to
> control the python script's memory usage.
> And heartbeat tuple needs a separate queue(pending_heartbeats), and
> Child-Process handle heartbeat tuple at high priority. [~kabhwan] wish to
> hear your opinion.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)