Github user justinleet commented on a diff in the pull request:
https://github.com/apache/metron/pull/1229#discussion_r223754681
--- Diff: metron-analytics/metron-profiler-spark/README.md ---
@@ -265,6 +290,18 @@ The path to the input data read by the Batch Profiler.
The format of the input data read by the Batch Profiler.
+### `profiler.batch.input.reader`
--- End diff --
The main thing I'm getting at here is what happens if instead of
```
profiler.batch.input.reader=COLUMNAR
profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc
```
I say
```
profiler.batch.input.reader=TEXT
profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc
```
Correct me if I'm wrong, but I believe it'll instantiate a
`TextEncodedTelemetryReader` instead of a `ColumnEncodedTelemetryReader`, then
fail to read the file.
This is a super easy misconfiguration to make as it is right now. Is it
potentially reasonable to keep both fields, but let you shortcut known formats
(e.g. ORC and Parquet)? Or log a warning that a known misconfig happened and
then proceed with the COLUMNAR option anyway?
---