[jira] [Updated] (OOZIE-3249) [tools] Instrumentation log parser

Andras Piros (JIRA) Thu, 10 May 2018 02:47:23 -0700

     [ 
https://issues.apache.org/jira/browse/OOZIE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andras Piros updated OOZIE-3249:
--------------------------------
    Description: 
Oozie instrumentation logs contain a lot of information, but are difficult to 
parse, because per instrumentation log entry there is always one header line in 
plain text format (containing timestamp), and multiple other lines in JSON 
format (not containing timestamp). Those lines of course belong together.
{noformat}
2018-05-02 02:48:13,426  INFO oozieinstrumentation:520 - USER[-] GROUP[-] 
TOKEN[-] APP[-] JOB[-] ACTION[-] 
{
...
  "counters" : {
...
    "callablequeue.executed" : {
      "count" : 5954144
    },
...
    "callablequeue.queued" : {
      "count" : 10596129
    },
...
  },
...
}
{noformat}

There should be a simple script in {{tools/bin}} that takes as parameters:
* input file name ({{-i}}), e.g. {{-i /path/to/oozie-instrumentation.log}}
* output file name ({{-o}}), e.g. {{-o /path/to/oozie-instrumentation.log.out}}
* parameters to extract ({{-p}}) in the format of 
{{path/to/json/value1,path/to/json/value2}}, in this case {{-p 
counters/callablequeue.executed/count,counters/callablequeue.queued/count}}

The output file should contain in CSV format:
* a header line containing column names for
* one line per parsed input header / JSON lines, containing:
** first cell is the minutes part of the timestamp
** consecutive cells are parsed JSON values given each parameter to extract

  was:
Oozie instrumentation logs contain a lot of information, but are difficult to 
parse, because per instrumentation log entry there is always one header line in 
plain text format (containing timestamp), and multiple other lines in JSON 
format (not containing timestamp). Those lines of course belong together.
{noformat}
2018-05-02 02:48:13,426  INFO oozieinstrumentation:520 - USER[-] GROUP[-] 
TOKEN[-] APP[-] JOB[-] ACTION[-] 
{
...
  "counters" : {
...
    "callablequeue.executed" : {
      "count" : 5954144
    },
...
    "callablequeue.queued" : {
      "count" : 10596129
    },
...
  },
...
}
{noformat}

There should be a simple script in {{tools/bin}} that takes as parameters:
* input file name ({{-i}}), e.g. {{-i /path/to/oozie-instrumentation.log}}
* output file name ({{-o}}), e.g. {{-o /path/to/oozie-instrumentation.log.out}}
* parameters to extract ({{-p}}) in the format of 
{{path/to/json/value1,path/to/json/value2}}, in this case {{-p 
counters/callablequeue/executed/count,counters/callablequeue/queued/count}}

The output file should contain in CSV format:
* a header line containing column names for
* one line per parsed input header / JSON lines, containing:
** first cell is the minutes part of the timestamp
** consecutive cells are parsed JSON values given each parameter to extract


> [tools] Instrumentation log parser
> ----------------------------------
>
>                 Key: OOZIE-3249
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3249
>             Project: Oozie
>          Issue Type: Improvement
>          Components: tools
>    Affects Versions: 5.0.0
>            Reporter: Andras Piros
>            Assignee: Andras Piros
>            Priority: Major
>         Attachments: OOZIE-3249.001.patch, 
> oozie-instrumentation-localhost.log.2018-05-09, 
> oozie-instrumentation-localhost.log.2018-05-09.out
>
>
> Oozie instrumentation logs contain a lot of information, but are difficult to 
> parse, because per instrumentation log entry there is always one header line 
> in plain text format (containing timestamp), and multiple other lines in JSON 
> format (not containing timestamp). Those lines of course belong together.
> {noformat}
> 2018-05-02 02:48:13,426  INFO oozieinstrumentation:520 - USER[-] GROUP[-] 
> TOKEN[-] APP[-] JOB[-] ACTION[-] 
> {
> ...
>   "counters" : {
> ...
>     "callablequeue.executed" : {
>       "count" : 5954144
>     },
> ...
>     "callablequeue.queued" : {
>       "count" : 10596129
>     },
> ...
>   },
> ...
> }
> {noformat}
> There should be a simple script in {{tools/bin}} that takes as parameters:
> * input file name ({{-i}}), e.g. {{-i /path/to/oozie-instrumentation.log}}
> * output file name ({{-o}}), e.g. {{-o 
> /path/to/oozie-instrumentation.log.out}}
> * parameters to extract ({{-p}}) in the format of 
> {{path/to/json/value1,path/to/json/value2}}, in this case {{-p 
> counters/callablequeue.executed/count,counters/callablequeue.queued/count}}
> The output file should contain in CSV format:
> * a header line containing column names for
> * one line per parsed input header / JSON lines, containing:
> ** first cell is the minutes part of the timestamp
> ** consecutive cells are parsed JSON values given each parameter to extract



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (OOZIE-3249) [tools] Instrumentation log parser

Reply via email to