[
https://issues.apache.org/jira/browse/FLUME-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872800#comment-13872800
]
Hari Shreedharan commented on FLUME-2294:
-----------------------------------------
This patch looks good. I have a few comments:
* There are a bunch of lines > 80 chars. Please make sure all lines are < 80
chars.
* There is at least 1 "FIXME". That probably should be fixed ;)
* Nit: In general, most flume components put their configuration parameter
names and defaults in separate classes as static finals (see
HBaseSink/FileChannel etc). Only HDFS sink being older and more difficult to
refactor keeps it in the same class. Also we usually name them as
CONFIG_HDFS_URI etc.
* super.start() should be called at the end of the start method, since this
tells the framework that the sink was successfully started and the framework
will not try to start it again. But if the start method throws, it must retry
and if this method is called first the framework would not.
{code}
new ThreadFactoryBuilder().setNameFormat(getName() + "")
{code}
* Is the + "" needed? I think it should be called something like getName() +
"-timed-roll-thread" or something like that.
* nit: Scheduling the roll timer scheduleWithFixedDelay might be better than
using scheduleAtFixedRate.
* Looks like the rolling is not proactive. The rolling is done only if another
event comes after a scheduled delay. This can mean that the file may not be
rolled for quite a while (or even never, if no data is written to that
partition again). This affects data visibility as other processes may not see
the data till the file is closed (or might skip it since the file would still
be open for write). We probably should make it more proactive.
* channel.getTransaction() and transaction.begin() should also be inside the
try (since a transaction that may have been opened should be closed).
* Unfortunately, even transaction.rollback() can throw (for example in file
channel if there is no disk space etc). This needs to be wrapped in try-catch
with a error mesg logged in the catch block.
* Could you also add a test where the data goes into different directories
using Kite's partitioning mechanisms?
* There seems to be no documentation on how to use the sink and what headers
are required. Could you please add detailed docs and also describe the config
params? Also, since this is a new component please mark it is as experimental
in the docs (see the twitter source as an example).
Thanks Ryan!
> Add a sink for Kite Datasets
> ----------------------------
>
> Key: FLUME-2294
> URL: https://issues.apache.org/jira/browse/FLUME-2294
> Project: Flume
> Issue Type: New Feature
> Components: Sinks+Sources
> Affects Versions: v1.4.0
> Reporter: Ryan Blue
> Assignee: Ryan Blue
> Attachments: 0001-FLUME-2294-Add-sink-for-Kite-Datasets.patch,
> 0002-FLUME-2294-Add-sink-for-Kite-Datasets.patch
>
>
> I'd like to add a flume sink for Kite (kitesdk.org) Datasets. This is an API
> for working with data in Hadoop, which can be backed by partitioned HDFS
> directories (with support that syncs with Hive) and HBase tables. This sink
> will depend on the Kite library and use it to write.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)