[ 
https://issues.apache.org/jira/browse/FLUME-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393824#comment-15393824
 ] 

Attila Simon commented on FLUME-2954:
-------------------------------------

After checking each of the log statements I collected the suspicious ones then 
started triaging them. Thanks [~denes] for helping in the latter. For getting 
the logs I used grep -i logger on the whole project then whenever the private 
variable name was not logger I checked that variable individually within those 
classes (declaration type is Logger). My recommendation is the following (sorry 
for being this long). I would like to provide a patch soon which might help the 
discussion further. I think that these changes won't loose focus (which is 
making logging of sensitive data available but turning on requires clear and 
explicit changes) but again I hope it will be an open discussion so every 
comment is welcomed.

The issues can be grouped into these categories:
# deliberately print out whole configuration at startup as part of the 
validation
#* we can decide whether to drop this completely or log this completely 
controlled by a command line option or environment variable or jvm argument
#* or we use some heuristics to find and filter out the private information 
like passwords, and keys.
# redundantly print property configuration
#* in LifeCycleAware components it is not needed at all since validation 
already has it.
# log data on error (safe data) 
#* since error is something not expected as part of a production workflow we 
should leave these there. (It is partially or as a whole broken so it should be 
considered as kind of garbage anyway)
# log data in dedicated components LoggerSink
#* keep it there
# log data in non dedicated Source (fail data): 
#* since Sources are responsible converting InputStream to Events it is needed 
to have a print option. For this I would introduce a new property named 
consistently to Sources to log out the raw ByteInputStream. Also trace log the 
fact of Event creation (no data). And remove everything else sensitive data 
related.
# log data in Interceptors, Processors, Handlers
#* remove these statements
# log data in non dedicated Channels (fail data): 
#* channels don't change data so identical to Sinks
# log data in non dedicated Sinks (fail data): 
#* remove existing log statements, one can specify an additional MemoryChannel 
ending in a LoggerSink for debugging purposes
# log potentially private info as part of a URL or URI
#* provide a safe toString for URL and URI
# AsyncHBaseSink#641 
#* further investigation is needed

Essentially LoggerSink would remain to log customer data (so specifying it 
would be explicit). Besides this there would be a configuration option (default 
to false) on Sources (only for those which currently log raw data) to log out 
raw byte stream in a separately named logger on trace. Other components would 
not log raw data they may log that an event was passed through only. I would 
also update the documentation to make it clear that if one would like to see 
what goes through then she should use LoggerSink. Configuration should be 
logged at validation time during startup. 

{noformat}
--------------------------------------------------------------------------------
flume-ng-auth                                 ---
  KerberosAuthenticator#167                   <- safe
--------------------------------------------------------------------------------
flume-ng-channel                              ---
  flume-file-channel                          ---
    JCEFileKeyProvider#111                    <- safe
    Log#335                                   <- safe
    FileChannel#276 #324                      <- safe
  flume-jdbc-channel                          ---
    JdbcChannelProviderImpl#98                <- fail properties
    JdbcChannelProviderImpl#261 #431          <- fail properties: jdbc url 
might include password
    DerbySchemaHandler#584 #770               <- safe
  flume-kafka-channel                         ---
    KafkaChannel#230 #253                     <- fail properties
    KafkaChannel#367 #383                     <- safe
    KafkaChannel#578                          <- safe
  flume-spillable-memory-channel              ---
    SpillableMemoryChannel#420 #425           <- safe
--------------------------------------------------------------------------------
flume-ng-clients                              ---
--------------------------------------------------------------------------------
flume-ng-configuration                        ---
  FlumeConfiguration#315 #372                 <- fail properties
  FlumeConfiguration#671                      <- safe
  FlumeConfiguration#927                      <- safe
--------------------------------------------------------------------------------
flume-ng-core                                 ---
  SyslogAvroEventSerializer#150               <- fail data: SyslogEvent.message 
gets logged
  SyslogAvroEventSerializer#171 #179          <- safe data: error logs only if 
date is malformed
  GangliaServer#224 #245                      <- fail data: although this might 
be only flume internal data
  LifecycleController#56                      <- safe
  LifecycleSupervisor#212 219 228 231 241 251 258 282 296 188 135 163 169 <- 
safe
  RegexExtractorInterceptor#144               <- safe
  AbstractRpcSink#287                         <- safe
  FailoverSinkProcessor#149                   <- safe
  LoadBalancingSinkProcessor#131              <- safe
  LoggerSink#95                               <- fail data: on purpose
  AvroSource#347                              <- fail data: log whole message
  ExecSource#457                              <- safe data: if execution has 
stderr then it will be error logged
  MultiportSyslogTCPSource#360                <- fail data: log whole message
  MultiportSyslogTCPSource#253 #264 #269      <- safe
  PollableSourceRunner#127                    <- safe
  ChannelProcessor#196 #226 #271 #298         <- safe
  BLOBHandler#70                              <- fail data: logs http request 
headers
--------------------------------------------------------------------------------
flume-ng-dist                                 ---
--------------------------------------------------------------------------------
flume-ng-embedded-agent                       ---
  EmbeddedAgent#155                           <- fail properties: printing all 
config
  EmbeddedAgent#249                           <- safe
--------------------------------------------------------------------------------
flume-ng-legacy-sources                       ---
--------------------------------------------------------------------------------
flume-ng-node                                     ---
  Application.java#100                            <- safe
  Application.java #107 #117 #127 #148 #175 #186  <- safe
  AbstractConfigurationProvider #116              <- safe
--------------------------------------------------------------------------------
flume-ng-sdk                                  ---
  LoadBalancingRpcClient#203                  <- safe
  FailoverRpcClient#268 #280                  <- safe
--------------------------------------------------------------------------------
flume-ng-sinks                                ---
  flume-dataset-sink                          ---
    DatasetSink#483 (URI)                     <- safe, Kite URIs don’t contain 
sensitive information 
  flume-hdfs-sink                             ---
    HDFSEventSink#163 #165                    <- safe
  flume-hive-sink                             ---
    HiveEndPoint has an URI field.            <- fail properties
        Unfortunately it can contain private data 
        (URI string may contain password) as it is 
        excessively logged within this module. 
        Appears in HiveSink#298 #342 #400 #403 #428, 
        HiveWriter#210 #319 #330 #337 #353 #365 #368 #407...) 
        HiveEndPoint is also attached to exception logs as well
    HiveWriter#160                            <- safe data: log whole on parse 
error         
  flume-irc-sink                              ---
    IRCSink#73 #77                            <- safe data: log whole on error
  flume-ng-elasticsearch-sink                 ---
    ElasticSearchRestClient#136               <- safe data: only status response
  flume-ng-hbase-sink                         ---
    AsyncHBaseSink#641                        ?? async callback chain, 
exception gets logged. further investigation is needed
  flume-ng-kafka-sink                         ---
    KafkaSink#179                             <- fail data: log whole message
    KafkaSink#304                             <- fail properties
  flume-ng-morphline-solr-sink                ---
    MorphlineHandlerImpl#132                  <- safe data: log whole on 
process error
    BlobHandler#98 #113                       <- fail data: log http request 
headers
    MorphlineSink#88                          <- safe
    MorphlineSink#139                         <- fail data: logs event
--------------------------------------------------------------------------------
flume-ng-sources                              ---
  flume-jms-source                            ---
    JMSMessageConsumer#114                    <- safe
  flume-kafka-source                          ---
    KafkaSource#247                           <- fail data: log whole
    KafkaSource#392 #416                      <- safe
  flume-scribe-source                         ---
  flume-taildir-source                        ---
  flume-twitter-source                        ---
    TwitterSource#132                         <- safe
    TwitterSource#110-113                     <- fail properties
--------------------------------------------------------------------------------
flume-ng-tests                                ---
--------------------------------------------------------------------------------
{noformat}

> make raw data appearing in log messages explicit
> ------------------------------------------------
>
>                 Key: FLUME-2954
>                 URL: https://issues.apache.org/jira/browse/FLUME-2954
>             Project: Flume
>          Issue Type: Improvement
>          Components: Channel, Configuration, Sinks+Sources
>    Affects Versions: v1.6.0
>            Reporter: Attila Simon
>            Assignee: Attila Simon
>            Priority: Critical
>
> Flume has built in functionality to log out data flowing through
> mainly for debugging purposes. This functionality appears in several
> places of the codebase. I think such functionality rise security
> concerns in production environments where sensitive information might
> be ingested so it is crucial that enabling such functionality has to
> be as explicit as possible (avoid implicit side effect setup).
> Eg: setting the level of root logger to debug/trace cause that every
> other logger will start logging at debug/trace including the ones
> logging raw data.
> In this jira I would like to provide a patch capturing how I imagined solving 
> this issue. It can be refined iteratively or used as a basis for a broader 
> discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to