[
https://issues.apache.org/jira/browse/STORM-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737495#comment-14737495
]
ASF GitHub Bot commented on STORM-828:
--------------------------------------
Github user redsanket commented on a diff in the pull request:
https://github.com/apache/storm/pull/668#discussion_r39089306
--- Diff:
external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/CSVFileBolt.java
---
@@ -0,0 +1,32 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.storm.hdfs.bolt;
+
+import org.apache.storm.hdfs.bolt.format.DefaultFileNameFormat;
+import org.apache.storm.hdfs.bolt.format.RecordFormat;
+import org.apache.storm.hdfs.bolt.rotation.TimedRotationPolicy;
+import org.apache.storm.hdfs.bolt.sync.CountSyncPolicy;
+import org.apache.storm.hdfs.common.rotation.MoveFileAction;
+
+public class CSVFileBolt extends HdfsBolt {
+ private static String fileExtension = ".csv";
+
+ public CSVFileBolt(String sourceDir, String destDir) {
+ super(sourceDir, destDir, fileExtension);
+ }
+}
--- End diff --
For futher clarification, I was further thinking over this issue and I
presume currently the spout emits tuples in the form of fields and the csv or
tsv bolt joins them with the specified record format delimiter. If it is a ",",
all tuples will be appended by a comma. The HdfsBolt actually does this
implementation when the execute is called upon it. The TSV or CSV are just
abstractions to get the intended values based on the record delimiter. Can you
please let me know what exactly has to be done with an example, I thought I got
your point but right now I cannot clearly picture it. It will be great if you
can get back with a reply. Thanks a lot.
> HdfsBolt takes a lot of configuration, need good defaults
> ---------------------------------------------------------
>
> Key: STORM-828
> URL: https://issues.apache.org/jira/browse/STORM-828
> Project: Apache Storm
> Issue Type: Improvement
> Reporter: Robert Joseph Evans
> Assignee: Sanket Reddy
>
> The following is code from
> https://github.com/apache/storm/blob/master/external/storm-hdfs/src/test/java/org/apache/storm/hdfs/bolt/HdfsFileTopology.java
> representing the amount of configuration required to use the HdfsBolt.
> {code}
> // sync the filesystem after every 1k tuples
> SyncPolicy syncPolicy = new CountSyncPolicy(1000);
> // rotate files every 1 min
> FileRotationPolicy rotationPolicy = new TimedRotationPolicy(1.0f,
> TimedRotationPolicy.TimeUnit.MINUTES);
> FileNameFormat fileNameFormat = new DefaultFileNameFormat()
> .withPath("/tmp/foo/")
> .withExtension(".txt");
> RecordFormat format = new DelimitedRecordFormat()
> .withFieldDelimiter("|");
> Yaml yaml = new Yaml();
> InputStream in = new FileInputStream(args[1]);
> Map<String, Object> yamlConf = (Map<String, Object>) yaml.load(in);
> in.close();
> config.put("hdfs.config", yamlConf);
> HdfsBolt bolt = new HdfsBolt()
> .withConfigKey("hdfs.config")
> .withFsUrl(args[0])
> .withFileNameFormat(fileNameFormat)
> .withRecordFormat(format)
> .withRotationPolicy(rotationPolicy)
> .withSyncPolicy(syncPolicy)
> .addRotationAction(new
> MoveFileAction().toDestination("/tmp/dest2/"));
> {code}
> This is way too much. If it were just an example showing all of the
> possibilities that would be OK but of the 8 lines used in the construction of
> the bolt, 5 of them are required or the bolt will blow up at run time. We
> should provide reasonable defaults for everything that can have a reasonable
> default. And required parameters should be passed in through the
> constructor, not as builder arguments. I realize we need to maintain
> backwards compatibility so we may need some new Bolt definitions.
> {code}
> HdfsTSVBolt bolt = new HdfsTSVBolt(outputDir);
> {code}
> If someone wanted to sync every 100 records instead of every 1000 we could do
> {code}
> TSVFileBolt bolt = new TSVFileBolt(outputDir).withSyncPolicy(new
> CountSyncPolicy(100))
> {code}
> I would like to see a base HdfsFileBolt that requires a record format, and an
> output directory. It would have defaults for everything else. Then we could
> have a TSVFileBolt and CSVFileBolt subclass it and ideally SequenceFileBolt
> as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)