[
https://issues.apache.org/jira/browse/STORM-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695711#comment-14695711
]
ASF GitHub Bot commented on STORM-837:
--------------------------------------
Github user d2r commented on a diff in the pull request:
https://github.com/apache/storm/pull/644#discussion_r37006689
--- Diff:
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/trident/HdfsState.java
---
@@ -136,44 +174,98 @@ public void run() {
private transient FSDataOutputStream out;
protected RecordFormat format;
private long offset = 0;
+ private int bufferSize = 131072; // default 128 K
- public HdfsFileOptions withFsUrl(String fsUrl){
+ public HdfsFileOptions withFsUrl(String fsUrl) {
this.fsUrl = fsUrl;
return this;
}
- public HdfsFileOptions withConfigKey(String configKey){
+ public HdfsFileOptions withConfigKey(String configKey) {
this.configKey = configKey;
return this;
}
- public HdfsFileOptions withFileNameFormat(FileNameFormat
fileNameFormat){
+ public HdfsFileOptions withFileNameFormat(FileNameFormat
fileNameFormat) {
this.fileNameFormat = fileNameFormat;
return this;
}
- public HdfsFileOptions withRecordFormat(RecordFormat format){
+ public HdfsFileOptions withRecordFormat(RecordFormat format) {
this.format = format;
return this;
}
- public HdfsFileOptions withRotationPolicy(FileRotationPolicy
rotationPolicy){
+ public HdfsFileOptions withRotationPolicy(FileRotationPolicy
rotationPolicy) {
this.rotationPolicy = rotationPolicy;
return this;
}
- public HdfsFileOptions addRotationAction(RotationAction action){
+ /**
+ * <p>Set the size of the buffer used for hdfs file copy in case
of recovery. The default
+ * value is 131072.</p>
+ *
+ * <p> Note: The lower limit for the parameter is 4096, below
which the
+ * option is ignored. </p>
+ *
+ * @param sizeInBytes the buffer size in bytes
+ * @return {@link HdfsFileOptions}
+ */
+ public HdfsFileOptions withBufferSize(int sizeInBytes) {
+ this.bufferSize = Math.max(4096, sizeInBytes); // at least 4K
+ return this;
+ }
+
+ @Deprecated
--- End diff --
Why do we deprecate this?
> HdfsState ignores commits
> -------------------------
>
> Key: STORM-837
> URL: https://issues.apache.org/jira/browse/STORM-837
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Robert Joseph Evans
> Assignee: Arun Mahadevan
> Priority: Critical
>
> HdfsState works with trident which is supposed to provide exactly once
> processing. It does this two ways, first by informing the state about
> commits so it can be sure the data is written out, and second by having a
> commit id, so that double commits can be handled.
> HdfsState ignores the beginCommit and commit calls, and with that ignores the
> ids. This means that if you use HdfsState and your worker crashes you may
> both lose data and get some data twice.
> At a minimum the flush and file rotation should be tied to the commit in some
> way. The commit ID should at a minimum be written out with the data so
> someone reading the data can have a hope of deduping it themselves.
> Also with the rotationActions it is possible for a file that was partially
> written is leaked, and never moved to the final location, because it is not
> rotated. I personally think the actions are too generic for this case and
> need to be deprecated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)