[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21878 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205365847 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- Thanks @gengliangwang for clarifying this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205360960 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- Looks same thing could happen in Kafka too already. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205360211 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- Ah, I see. Okay. Then, how about this: I assume the documentation about Avro will be done to Spark for 2.4.0 soon. When it's done, we add a message here for Avro like please see the documentation to use Spark's avro package? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205353836 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- I totally agree with the mapping, we should do it. The comment here is about when Spark can't find any avro package, we should show a message for loading the spark-avro jar(org.apache.spark.sql.avro). Different from CSV, the package spark-avro is not loaded by default within Spark(at least as I tried spark-shell). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205350947 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- Or is the Avro jar meant to be separately distributed? I thought it'd be included within Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205350125 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- Eh, if users were using the external avro, they will likely meet the error if they directly upgrade Spark. Otherwise, users will see the release note that Avro pacakge is included in 2.4.0, and they will not provide this jar. If users miss this release note, then they will try to explicitly provide the thirdparty jar which will give the error message above. FWIW, if it's fully qualified path, the thridparty jar will still be used in theory. Did I misunderstand or miss something maybe? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205348403 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- No, I mean by default the avro package is not loaded. E.g. If we start spark-shell without loading the jar, then it will show error "Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205346891 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- Actually, I think it would be okay. If user provide `avro`, it will show an error like: ``` 17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), defaulting to the internal datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat). ``` in most cases (see https://github.com/apache/spark/pull/17916) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21878#discussion_r205345025 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -635,12 +637,6 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") -} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || - provider1 == "com.databricks.spark.avro") { - throw new AnalysisException( --- End diff -- Should we show message to user for loading the built-in spark-avro jar? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21878: [SPARK-24924][SQL] Add mapping for built-in Avro ...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/21878 [SPARK-24924][SQL] Add mapping for built-in Avro data source ## What changes were proposed in this pull request? This PR aims to the followings. 1. Like `com.databricks.spark.csv` mapping, we had better map `com.databricks.spark.avro` to built-in Avro data source. 2. Remove incorrect error message, `Please find an Avro package at ...`. ## How was this patch tested? Pass the newly added tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-24924 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21878.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21878 commit d95ba4081ac1188515b7e6363640700d56f2c93f Author: Dongjoon Hyun Date: 2018-07-25T22:51:56Z [SPARK-24924][SQL] Add mapping for built-in Avro data source --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org