[ 
https://issues.apache.org/jira/browse/STORM-738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482655#comment-14482655
 ] 

DashengJu commented on STORM-738:
---------------------------------

[~kabhwan]  I have two questions about heartbeat mechanism:
1) I think the heartbeat mechanism should not related to ACK or NON-ACK 
mechanism. I means heartbeat should not rely on  ACK or NON-ACK, or else.
2) the heartbeat design constraint(subprocess is alive but cannot process 
"heartbeat" tuple in time) is always happen. For example: a) if ShellBolt 
handle one tuple need more time, may exceed the timeout, but it handle other 
tuples quickly;  b)  currently, the ShellBolt handle one tuple, emit the 
result, and read all the tuples from stdin to get the id,  read all the tuples 
from stdin needs long time, which cause exceed timeout. So I think we should 
remove the  heartbeat design constraint. 


> Multilang needs Overflow-Control mechanism and HeartBeat timeout problem
> ------------------------------------------------------------------------
>
>                 Key: STORM-738
>                 URL: https://issues.apache.org/jira/browse/STORM-738
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 0.9.3-rc2, 0.9.4, 0.11.0
>            Reporter: DashengJu
>            Priority: Critical
>         Attachments: storm_multilang.png
>
>
> hi, all
> we have a topology, which have 3 components(spout->parser->saver) and the 
> parser is Multilang bolt with python. We do not use ACK mechanism.
> we found 2 problems with Mutilang python script.
> 1) the parser python scripts may hold too many tuples and consume too many 
> memory;
> 2) with MultiLang heartbeat mechanism described by  
> https://issues.apache.org/jira/browse/STORM-513, the python script always 
> timeout to heartbeat, even when the parser bolt is normal, cause supervisor 
> to restart itself.
> !storm_multilang.png!
> ShellBolt process === Father-Process
> PythonScript process === Child-Process
> The reason is :
> 1) when topology do not use ACK mechanism, the spout do not have 
> Overflow-control ability, if the stream have too many tuples comes,  spout 
> will send all the tuples to parser's ShellBolt process(Father-Process);
> 2) parser's ShellBolt process just put the tuples to _pendingWrites queue, if 
> the _pendingWrites queue does not have limit;
> 3) parser's PythonScript process(Child-Process) call readMsg() to read a 
> tuple from STDIN, handle the tuple, and emit a new tuple to its father 
> process through STDOUT, and then call readTaskIds() from STDIN.  Because 
> Father-Process's queue already have too many other tuples, Child-Process will 
> read all the tuples to pending_commands, util received TaskIds.
> 4) so Child-Process process's pending_commands may contains too many tuples 
> and consume too many memory.
> As to heartbeat, because there are too many pending_commands need 
> Child-Process to handle, and Child-Process's every emit operation will need 
> more I/O read operations from STDIN. It may need 10 seconds to handle one 
> tuple, and this will cause the heartbeat tuple not handle quickly, and 
> timeout will happen.
> Even if Father-Process's _pendingWrites have limits, for example 1000, 
> Child-Process may needs 1000 x 1000 read operations then it can handle the 
> heartbeat tuple.
> [~revans2] [~kabhwan] this related to Multilang and heartbeat, please help to 
> confirm the two problems.
> I think Father-Process and Child-Process need Overflow-Control Protocol to 
> control the python script's memory usage.
> And heartbeat tuple needs a separate queue(pending_heartbeats), and 
> Child-Process handle heartbeat tuple at high priority. [~kabhwan] wish to 
> hear your opinion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to