> On June 21, 2012, 7:23 a.m., Hari Shreedharan wrote: > > +1. Looks good! I haven't review the code thoroughly, since Brock would be > > the best for that. I did test it out though - like Will mentioned, works > > well. Responds to Ctrl-C, phew! And restarts after kills also. I am not > > sure of edge cases or more subtle problems, but looks good in general. (I'm > > not going to mark Ship It - since I'd expect a full review from Brock to > > decide). > > > > Thanks Arvind! Great work! > > Hari Shreedharan wrote: > I have one comment though. Would it be possible to use the agent name, > pid and channel name to make sure multiple channels will work even if > multiple flume agents and/or multiple flows within the same agent work > properly? As of now it looks like only one agent can run using the default. > It might be a good idea to do something about it. > > Right now, if I use the exact same config on 2 different agents only one > of them will make any progress, not the other. I didn't check with multiple > flows within the same agent though. > > Arvind Prabhakar wrote: > Hari - thanks for taking the time to review and test the patch. Regarding > your comment - can you please elaborate more? Specifically - are you able to > run multiple agents when you specify different (non-default) data and > checkpoint directories? If so, then this is not really an issue as it was > discussed as part of FLUME-1085 patch and we decided against it. Please see > https://reviews.apache.org/r/4661/ for the discussion. > > Hari Shreedharan wrote: > Btw, the above situation also causes the logs to be filled with > IOException : > java.lang.RuntimeException: java.io.IOException: Cannot lock > /Users/hshreedharan/.flume/file-channel/checkpoint. The directory is already > locked. > 2012-06-21 00:43:27,426 INFO nodemanager.DefaultLogicalNodeManager: > Waiting for channel: fileChannel to start. Sleeping for 500 ms > 2012-06-21 00:43:27,928 INFO nodemanager.DefaultLogicalNodeManager: > Waiting for channel: fileChannel to start. Sleeping for 500 ms > > > At this point, the channel causes the agent to not respond to SIGINT and > a kill -9 is needed to kill the agent.
Yes, with non-default, it works fine. If that was a conscious decision, please document it. - Hari ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/5436/#review8444 ----------------------------------------------------------- On June 21, 2012, 1:29 a.m., Arvind Prabhakar wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/5436/ > ----------------------------------------------------------- > > (Updated June 21, 2012, 1:29 a.m.) > > > Review request for Flume and Brock Noland. > > > Description > ------- > > The file channel uses a disk-serialized in-memory checkpointing mechanism. > When the channel is full and the capacity is large, these checkpoints take a > long time to serialize and deserialize. For example, a channel with 1M > entries could take many minutes to boot up. Similarly, a boot up of a largely > full channel would require the replay of all log events to reconstruct the > correct state. Due to this latency issues and the failure interaction of the > channel with the LifeCycleSupervisor, the system could get into an unusable > state easily as evident from the FLUME-1232 issue. > > This patch modifies the checkpointing mechanism as follows: > * The FlumeEventQueue itself represents a checkpoint that is maintained as a > memory mapped file. > * During checkpointing, a marker is introduced in active logs which is used > to skip records during display. > > In order to ensure correctness, a reader/writer lock is used where the reader > lock is used by consumers operating against the channel while the writer lock > is used to facilitate checkpointing. Some limitations of this approach are: > > * The total number of active log files is now limited to a maximum of 1024. > * Dynamic resizing of the channel capacity is no longer allowed unless the > checkpoint is rebuild from scratch which can cause significant delay in > startup. > > > This addresses bug FLUME-1232. > https://issues.apache.org/jira/browse/FLUME-1232 > > > Diffs > ----- > > > /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Checkpoint.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannel.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannelConfiguration.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FlumeEventQueue.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Log.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/ReplayHandler.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestCheckpoint.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestFileChannel.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestFlumeEventQueue.java > 1351988 > > /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestLog.java > 1351988 > /trunk/flume-ng-core/src/main/java/org/apache/flume/sink/LoggerSink.java > 1351988 > > Diff: https://reviews.apache.org/r/5436/diff/ > > > Testing > ------- > > Ran all tests. Did some manual testing. Will be doing more manual testing and > cleanup as necessary while the review is underway. > > > Thanks, > > Arvind Prabhakar > >
