> On Sept. 6, 2012, 7:30 a.m., Mike Percy wrote:
> > Hey, thanks for this contribution! I agree with the overall direction of
> > the approach here. However since by doing this, we are exposing these APIs
> > as public APIs, I think now would be a great time to take a good hard look
> > at FlumeFormatter itself...
> >
> > FlumeFormatter is not a very good interface, in my opinion. It is actually
> > two interfaces merged into one... someone would either call the
> > SequenceFile-related stuff, or would call getBytes(). I don't really think
> > that makes much sense and I think getBytes() should be removed from
> > FlumeFormatter. The only class that calls it is HDFSCompressedDataStream,
> > which should really just be using EventSerializer (which is what
> > HDFSDataStream uses).
> >
> > One additional thing we should consider is whether it would be doable to
> > allow a FlumeFormatter (which, sans getBytes(), is pretty targeted for
> > SequenceFiles) to write multiple key/value pairs from a single event. The
> > desire for that feature came up on the list a couple weeks ago, and if we
> > are committing to a public API then I'd like to give it a shot. It's easy
> > with EventSerializer, for example.
> >
> > Any thoughts on this? I think it's acceptable to expose the
> > SequenceFile.Writer in the interface if we need to do that to make this
> > possible. How about changing it to something like this?
> >
> > interface FlumeFormatter {
> > Class getKeyClass();
> > Class getValueClass();
> > void write(Event e, SequenceFile.Writer out);
> >
> > interface Builder {
> > FlumeFormatter build(Context context);
> > }
> > }
> >
Mike,
Thanks for the review. I agree with your suggested interface with
FlumeFormatter. I think it's reasonable to assume that anybody writing their
own FlumeFormatter will know how to handle a SequenceFile.Writer directly. This
gives users a lot of flexibility (e.g. to call sync() to finish a compression
block when they want to, or to call syncFs() in order to ensure their data has
been replicated).
With these changes FlumeFormatter will become dependent on Hadoop classes, so I
guess it should be moved from flume-ng-core to flume-hdfs-sink. I am also
tempted to change the name, as the interface is now SequenceFile-specific and
is now in charge of writing, rather than just extracting a key and value. Maybe
HDFSSeqFileWriter?
I will have a play around and attach another patch, probably next week.
- Chris
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6918/#review11085
-----------------------------------------------------------
On Sept. 5, 2012, 5:51 a.m., Chris Birchall wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6918/
> -----------------------------------------------------------
>
> (Updated Sept. 5, 2012, 5:51 a.m.)
>
>
> Review request for Flume.
>
>
> Description
> -------
>
> This patch allows users to customise the format of HDFS SequenceFiles by
> providing a custom FlumeFormatter implementation.
>
> Currently, the user can set hdfs.writeFormat to either "Text" or "Writable",
> corresponding to HDFSTextFormatter and HDFSWritableFormatter respectively.
>
> With this patch, hdfs.writeFormat can also be set to the full class name of a
> Builder implementation, e.g.:
>
> agent_foo.sinks.hdfs-sink.writeFormat=com.mycompany.flume.MyCustomFormatter$Builder
>
> They can also pass custom configuration params to the builder, e.g.:
> agent_foo.sinks.hdfs-sink.writeFormat.ignoreHeaders=foo,bar
>
> These params will be passed to the Builder's build() method as a Context
> object.
>
> I've tried to be as consistent as possible with the design of
> EventSerializerFactory:
> * Use an enum for the different formatter types, rather than static strings.
> * Use a Builder, rather than constructing a FlumeFormatter directly.
>
>
> This addresses bug FLUME-1100.
> https://issues.apache.org/jira/browse/FLUME-1100
>
>
> Diffs
> -----
>
> flume-ng-core/src/main/java/org/apache/flume/sink/FlumeFormatter.java
> 05d60b7
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java
> 9a76ecb
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSFormatterFactory.java
> 7b47942
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSFormatterType.java
> PRE-CREATION
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSTextFormatter.java
> 5839dbb
>
> flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSWritableFormatter.java
> 9f03389
>
> flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/MyCustomFormatter.java
> PRE-CREATION
>
> flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSFormatterFactory.java
> PRE-CREATION
>
> Diff: https://reviews.apache.org/r/6918/diff/
>
>
> Testing
> -------
>
> Unit tests included in patch.
>
> Using a patched build of Flume in an internal project (not in production).
>
>
> Thanks,
>
> Chris Birchall
>
>