[jira] [Commented] (FLUME-2294) Add a sink for Kite Datasets

Hari Shreedharan (JIRA) Wed, 15 Jan 2014 15:37:15 -0800

    [ 
https://issues.apache.org/jira/browse/FLUME-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872800#comment-13872800
 ]


Hari Shreedharan commented on FLUME-2294:
-----------------------------------------

This patch looks good. I have a few comments:

* There are a bunch of lines > 80 chars. Please make sure all lines are < 80 
chars.
* There is at least 1 "FIXME". That probably should be fixed ;)
* Nit: In general, most flume components put their configuration parameter 
names and defaults in separate classes as static finals (see 
HBaseSink/FileChannel etc). Only HDFS sink being older and more difficult to 
refactor keeps it in the same class. Also we usually name them as 
CONFIG_HDFS_URI etc.
* super.start() should be called at the end of the start method, since this 
tells the framework that the sink was successfully started and the framework 
will not try to start it again. But if the start method throws, it must retry 
and if this method is called first the framework would not.
{code}
new ThreadFactoryBuilder().setNameFormat(getName() + "")
{code}
* Is the + "" needed? I think it should be called something like getName() + 
"-timed-roll-thread" or something like that.
* nit: Scheduling the roll timer scheduleWithFixedDelay might be better than 
using scheduleAtFixedRate.
* Looks like the rolling is not proactive. The rolling is done only if another 
event comes after a scheduled delay. This can mean that the file may not be 
rolled for quite a while (or even never, if no data is written to that 
partition again). This affects data visibility as other processes may not see 
the data till the file is closed (or might skip it since the file would still 
be open for write). We probably should make it more proactive.
* channel.getTransaction() and transaction.begin() should also be inside the 
try (since a transaction that may have been opened should be closed).
* Unfortunately, even transaction.rollback() can throw (for example in file 
channel if there is no disk space etc). This needs to be wrapped in try-catch 
with a error mesg logged in the catch block.
* Could you also add a test where the data goes into different directories 
using Kite's partitioning mechanisms?
* There seems to be no documentation on how to use the sink and what headers 
are required. Could you please add detailed docs and also describe the config 
params? Also, since this is a new component please mark it is as experimental 
in the docs (see the twitter source as an example).

Thanks Ryan!



> Add a sink for Kite Datasets
> ----------------------------
>
>                 Key: FLUME-2294
>                 URL: https://issues.apache.org/jira/browse/FLUME-2294
>             Project: Flume
>          Issue Type: New Feature
>          Components: Sinks+Sources
>    Affects Versions: v1.4.0
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>         Attachments: 0001-FLUME-2294-Add-sink-for-Kite-Datasets.patch, 
> 0002-FLUME-2294-Add-sink-for-Kite-Datasets.patch
>
>
> I'd like to add a flume sink for Kite (kitesdk.org) Datasets. This is an API 
> for working with data in Hadoop, which can be backed by partitioned HDFS 
> directories (with support that syncs with Hive) and HBase tables. This sink 
> will depend on the Kite library and use it to write.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (FLUME-2294) Add a sink for Kite Datasets

Reply via email to