[
https://issues.apache.org/jira/browse/FLUME-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13393556#comment-13393556
]
Leslin (Hong Xiang Lin) edited comment on FLUME-1200 at 6/18/12 2:19 PM:
-------------------------------------------------------------------------
Below is the final implement:
1). if user set hdfs.codeC when hdfs.fileType = DataStream, sink will use
DataStream to output file, which is no compress extension like .snappy. Warning
message is added to show the codeC will be ignored.
2). Pre-check will make sure that codec is required when fileType is set
CompressedStream.
After carefully consider, last comment "(2) if user set hdfs.codeC while
fileType is CompressedStream, but codec class is unavailable" is another
problem and should be tracked with independent JIRA. BTW, there is another
problem with codec. I will address them in new JIRA together.
I tested with following scenarios:
1. compressStream without codec --> there will be exception
agent.sinks.k1.hdfs.fileType = CompressedStream
#agent.sinks.k1.hdfs.codeC = DefaultCodec
12/06/17 22:46:35 INFO sink.DefaultSinkFactory: Creating instance of sink k1
typeHDFS
12/06/17 22:46:35 ERROR properties.PropertiesFileConfigurationProvider: Failed
to load configuration data. Exception follows.
java.lang.NullPointerException: It's essential to set compress codec when
fileType is: CompressedStream
at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
at
org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java:221)
2. Works fine, output file with .deflate extension.
agent.sinks.k1.hdfs.fileType = CompressedStream
agent.sinks.k1.hdfs.codeC = DefaultCodec
3. Works fine, output file without compress extension.
agent.sinks.k1.hdfs.fileType = CompressedStream
agent.sinks.k1.hdfs.codeC = snappyCodec
4. There is warning, output file without compress extension.
agent.sinks.k1.hdfs.fileType = DataStream
#agent.sinks.k1.hdfs.codeC = snappyCodec
12/06/17 23:08:44 INFO snappy.LoadSnappy: Snappy native library loaded
12/06/17 23:08:44 WARN hdfs.HDFSEventSink: CodeC: snappyCodec is ignored as
fileType: DataStream is uncompressed. To change fileType if want output
compressed.
12/06/17 23:08:44 INFO hdfs.HDFSEventSink: Hadoop Security enabled: false
5. works fine, output file with .snappy extension
agent.sinks.k1.hdfs.fileType = SequenceFile
agent.sinks.k1.hdfs.codeC = snappyCodec
6. Works fine, output file without .snappy extension
agent.sinks.k1.hdfs.fileType = SequenceFile
#agent.sinks.k1.hdfs.codeC = snappyCodec
was (Author: leslin123):
Below is the final implement:
1). if user set hdfs.codeC when hdfs.fileType = DataStream, sink will use
DataStream to output file, which is no compress extension like .snappy. Warning
message is added to show the codeC will be ignored.
2). Pre-check will make sure that codec is required when fileType is set
CompressedStream.
I tested with following scenarios:
1. compressStream without codec --> there will be exception
agent.sinks.k1.hdfs.fileType = CompressedStream
#agent.sinks.k1.hdfs.codeC = DefaultCodec
12/06/17 22:46:35 INFO sink.DefaultSinkFactory: Creating instance of sink k1
typeHDFS
12/06/17 22:46:35 ERROR properties.PropertiesFileConfigurationProvider: Failed
to load configuration data. Exception follows.
java.lang.NullPointerException: It's essential to set compress codec when
fileType is: CompressedStream
at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
at
org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java:221)
2. Works fine, output file with .deflate extension.
agent.sinks.k1.hdfs.fileType = CompressedStream
agent.sinks.k1.hdfs.codeC = DefaultCodec
3. Works fine, output file without compress extension.
agent.sinks.k1.hdfs.fileType = CompressedStream
agent.sinks.k1.hdfs.codeC = snappyCodec
4. There is warning, output file without compress extension.
agent.sinks.k1.hdfs.fileType = DataStream
#agent.sinks.k1.hdfs.codeC = snappyCodec
12/06/17 23:08:44 INFO snappy.LoadSnappy: Snappy native library loaded
12/06/17 23:08:44 WARN hdfs.HDFSEventSink: CodeC: snappyCodec is ignored as
fileType: DataStream is uncompressed. To change fileType if want output
compressed.
12/06/17 23:08:44 INFO hdfs.HDFSEventSink: Hadoop Security enabled: false
5. works fine, output file with .snappy extension
agent.sinks.k1.hdfs.fileType = SequenceFile
agent.sinks.k1.hdfs.codeC = snappyCodec
6. Works fine, output file without .snappy extension
agent.sinks.k1.hdfs.fileType = SequenceFile
#agent.sinks.k1.hdfs.codeC = snappyCodec
> HDFSEventSink causes *.snappy file to be created in HDFS even when snappy
> isn't used (due to missing lib)
> ---------------------------------------------------------------------------------------------------------
>
> Key: FLUME-1200
> URL: https://issues.apache.org/jira/browse/FLUME-1200
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.2.0
> Environment: RHEL 6.2 64-bit
> Reporter: Will McQueen
> Assignee: Leslin (Hong Xiang Lin)
> Fix For: v1.2.0
>
> Attachments: FLUME-1200.patch
>
>
> If I use HDFSEventSink and specify the codec to be snappy, then the sink
> writes data to HDFS with the ".snappy" extension... but the content of those
> HDFS files is not in snappy format when the snappy libs aren't found. The log
> files mention this:
> 2012-05-11 19:38:49,868 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2012-05-11 19:38:49,868 WARN snappy.LoadSnappy: Snappy native library
> not loaded
> ...and I think it should be an error rather than a warning... the sink
> shouldn't write data at all to HDFS if it's not in the format expected by the
> config file (ie, not compressed with snappy). The config file I used is:
> agent.channels = c1
> agent.sources = r1
> agent.sinks = k1
> #
> agent.channels.c1.type = MEMORY
> #
> agent.sources.r1.channels = c1
> agent.sources.r1.type = SEQ
> #
> agent.sinks.k1.channel = c1
> agent.sinks.k1.type = LOGGER
> #
> agent.sinks.k1.channel = c1
> agent.sinks.k1.type = HDFS
> agent.sinks.k1.hdfs.path = hdfs://<host>:<port>:<path>
> agent.sinks.k1.hdfs.fileType = DataStream
> agent.sinks.k1.hdfs.codeC = SnappyCodec
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira