[
https://issues.apache.org/jira/browse/METRON-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643644#comment-16643644
]
ASF GitHub Bot commented on METRON-1809:
----------------------------------------
Github user justinleet commented on a diff in the pull request:
https://github.com/apache/metron/pull/1229#discussion_r223754681
--- Diff: metron-analytics/metron-profiler-spark/README.md ---
@@ -265,6 +290,18 @@ The path to the input data read by the Batch Profiler.
The format of the input data read by the Batch Profiler.
+### `profiler.batch.input.reader`
--- End diff --
The main thing I'm getting at here is what happens if instead of
```
profiler.batch.input.reader=COLUMNAR
profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc
```
I say
```
profiler.batch.input.reader=TEXT
profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc
```
Correct me if I'm wrong, but I believe it'll instantiate a
`TextEncodedTelemetryReader` instead of a `ColumnEncodedTelemetryReader`, then
fail to read the file.
This is a super easy misconfiguration to make as it is right now. Is it
potentially reasonable to keep both fields, but let you shortcut known formats
(e.g. ORC and Parquet)? Or log a warning that a known misconfig happened and
then proceed with the COLUMNAR option anyway?
> Support Column Oriented Input with Batch Profiler
> -------------------------------------------------
>
> Key: METRON-1809
> URL: https://issues.apache.org/jira/browse/METRON-1809
> Project: Metron
> Issue Type: Bug
> Reporter: Nick Allen
> Assignee: Nick Allen
> Priority: Major
>
> The Batch Profiler currently only accepts input formats that can be directly
> serialized to JSON. This should be enhanced to accept a wider variety of
> input formats.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)