[
https://issues.apache.org/jira/browse/FLUME-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393824#comment-15393824
]
Attila Simon commented on FLUME-2954:
-------------------------------------
After checking each of the log statements I collected the suspicious ones then
started triaging them. Thanks [~denes] for helping in the latter. For getting
the logs I used grep -i logger on the whole project then whenever the private
variable name was not logger I checked that variable individually within those
classes (declaration type is Logger). My recommendation is the following (sorry
for being this long). I would like to provide a patch soon which might help the
discussion further. I think that these changes won't loose focus (which is
making logging of sensitive data available but turning on requires clear and
explicit changes) but again I hope it will be an open discussion so every
comment is welcomed.
The issues can be grouped into these categories:
# deliberately print out whole configuration at startup as part of the
validation
#* we can decide whether to drop this completely or log this completely
controlled by a command line option or environment variable or jvm argument
#* or we use some heuristics to find and filter out the private information
like passwords, and keys.
# redundantly print property configuration
#* in LifeCycleAware components it is not needed at all since validation
already has it.
# log data on error (safe data)
#* since error is something not expected as part of a production workflow we
should leave these there. (It is partially or as a whole broken so it should be
considered as kind of garbage anyway)
# log data in dedicated components LoggerSink
#* keep it there
# log data in non dedicated Source (fail data):
#* since Sources are responsible converting InputStream to Events it is needed
to have a print option. For this I would introduce a new property named
consistently to Sources to log out the raw ByteInputStream. Also trace log the
fact of Event creation (no data). And remove everything else sensitive data
related.
# log data in Interceptors, Processors, Handlers
#* remove these statements
# log data in non dedicated Channels (fail data):
#* channels don't change data so identical to Sinks
# log data in non dedicated Sinks (fail data):
#* remove existing log statements, one can specify an additional MemoryChannel
ending in a LoggerSink for debugging purposes
# log potentially private info as part of a URL or URI
#* provide a safe toString for URL and URI
# AsyncHBaseSink#641
#* further investigation is needed
Essentially LoggerSink would remain to log customer data (so specifying it
would be explicit). Besides this there would be a configuration option (default
to false) on Sources (only for those which currently log raw data) to log out
raw byte stream in a separately named logger on trace. Other components would
not log raw data they may log that an event was passed through only. I would
also update the documentation to make it clear that if one would like to see
what goes through then she should use LoggerSink. Configuration should be
logged at validation time during startup.
{noformat}
--------------------------------------------------------------------------------
flume-ng-auth ---
KerberosAuthenticator#167 <- safe
--------------------------------------------------------------------------------
flume-ng-channel ---
flume-file-channel ---
JCEFileKeyProvider#111 <- safe
Log#335 <- safe
FileChannel#276 #324 <- safe
flume-jdbc-channel ---
JdbcChannelProviderImpl#98 <- fail properties
JdbcChannelProviderImpl#261 #431 <- fail properties: jdbc url
might include password
DerbySchemaHandler#584 #770 <- safe
flume-kafka-channel ---
KafkaChannel#230 #253 <- fail properties
KafkaChannel#367 #383 <- safe
KafkaChannel#578 <- safe
flume-spillable-memory-channel ---
SpillableMemoryChannel#420 #425 <- safe
--------------------------------------------------------------------------------
flume-ng-clients ---
--------------------------------------------------------------------------------
flume-ng-configuration ---
FlumeConfiguration#315 #372 <- fail properties
FlumeConfiguration#671 <- safe
FlumeConfiguration#927 <- safe
--------------------------------------------------------------------------------
flume-ng-core ---
SyslogAvroEventSerializer#150 <- fail data: SyslogEvent.message
gets logged
SyslogAvroEventSerializer#171 #179 <- safe data: error logs only if
date is malformed
GangliaServer#224 #245 <- fail data: although this might
be only flume internal data
LifecycleController#56 <- safe
LifecycleSupervisor#212 219 228 231 241 251 258 282 296 188 135 163 169 <-
safe
RegexExtractorInterceptor#144 <- safe
AbstractRpcSink#287 <- safe
FailoverSinkProcessor#149 <- safe
LoadBalancingSinkProcessor#131 <- safe
LoggerSink#95 <- fail data: on purpose
AvroSource#347 <- fail data: log whole message
ExecSource#457 <- safe data: if execution has
stderr then it will be error logged
MultiportSyslogTCPSource#360 <- fail data: log whole message
MultiportSyslogTCPSource#253 #264 #269 <- safe
PollableSourceRunner#127 <- safe
ChannelProcessor#196 #226 #271 #298 <- safe
BLOBHandler#70 <- fail data: logs http request
headers
--------------------------------------------------------------------------------
flume-ng-dist ---
--------------------------------------------------------------------------------
flume-ng-embedded-agent ---
EmbeddedAgent#155 <- fail properties: printing all
config
EmbeddedAgent#249 <- safe
--------------------------------------------------------------------------------
flume-ng-legacy-sources ---
--------------------------------------------------------------------------------
flume-ng-node ---
Application.java#100 <- safe
Application.java #107 #117 #127 #148 #175 #186 <- safe
AbstractConfigurationProvider #116 <- safe
--------------------------------------------------------------------------------
flume-ng-sdk ---
LoadBalancingRpcClient#203 <- safe
FailoverRpcClient#268 #280 <- safe
--------------------------------------------------------------------------------
flume-ng-sinks ---
flume-dataset-sink ---
DatasetSink#483 (URI) <- safe, Kite URIs don’t contain
sensitive information
flume-hdfs-sink ---
HDFSEventSink#163 #165 <- safe
flume-hive-sink ---
HiveEndPoint has an URI field. <- fail properties
Unfortunately it can contain private data
(URI string may contain password) as it is
excessively logged within this module.
Appears in HiveSink#298 #342 #400 #403 #428,
HiveWriter#210 #319 #330 #337 #353 #365 #368 #407...)
HiveEndPoint is also attached to exception logs as well
HiveWriter#160 <- safe data: log whole on parse
error
flume-irc-sink ---
IRCSink#73 #77 <- safe data: log whole on error
flume-ng-elasticsearch-sink ---
ElasticSearchRestClient#136 <- safe data: only status response
flume-ng-hbase-sink ---
AsyncHBaseSink#641 ?? async callback chain,
exception gets logged. further investigation is needed
flume-ng-kafka-sink ---
KafkaSink#179 <- fail data: log whole message
KafkaSink#304 <- fail properties
flume-ng-morphline-solr-sink ---
MorphlineHandlerImpl#132 <- safe data: log whole on
process error
BlobHandler#98 #113 <- fail data: log http request
headers
MorphlineSink#88 <- safe
MorphlineSink#139 <- fail data: logs event
--------------------------------------------------------------------------------
flume-ng-sources ---
flume-jms-source ---
JMSMessageConsumer#114 <- safe
flume-kafka-source ---
KafkaSource#247 <- fail data: log whole
KafkaSource#392 #416 <- safe
flume-scribe-source ---
flume-taildir-source ---
flume-twitter-source ---
TwitterSource#132 <- safe
TwitterSource#110-113 <- fail properties
--------------------------------------------------------------------------------
flume-ng-tests ---
--------------------------------------------------------------------------------
{noformat}
> make raw data appearing in log messages explicit
> ------------------------------------------------
>
> Key: FLUME-2954
> URL: https://issues.apache.org/jira/browse/FLUME-2954
> Project: Flume
> Issue Type: Improvement
> Components: Channel, Configuration, Sinks+Sources
> Affects Versions: v1.6.0
> Reporter: Attila Simon
> Assignee: Attila Simon
> Priority: Critical
>
> Flume has built in functionality to log out data flowing through
> mainly for debugging purposes. This functionality appears in several
> places of the codebase. I think such functionality rise security
> concerns in production environments where sensitive information might
> be ingested so it is crucial that enabling such functionality has to
> be as explicit as possible (avoid implicit side effect setup).
> Eg: setting the level of root logger to debug/trace cause that every
> other logger will start logging at debug/trace including the ones
> logging raw data.
> In this jira I would like to provide a patch capturing how I imagined solving
> this issue. It can be refined iteratively or used as a basis for a broader
> discussion.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)