[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

Hyukjin Kwon (JIRA) Fri, 03 Aug 2018 08:30:19 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568348#comment-16568348
 ]


Hyukjin Kwon edited comment on SPARK-24924 at 8/3/18 3:29 PM:
--------------------------------------------------------------

{quote}
but at the same time we aren't adding the spark.read.avro syntax so it break in 
that case or they get a different implementation by default?   
{quote}

If users call this, that's still going to use the builtin implemtnation 
(https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26)
 as it's a short name for {{format("com.databricks.spark.avro")}}.

{quote}
our internal implementation which could very well be different.
{quote}

It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

{quote}
 I would rather just plain error out saying these conflict, either update or 
change your external package to use a different name. 
{quote}

IIRC, in the past, we did for CSV datasource and many users complained about 
this.

{code}
java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
{code}

In practice, I am actually a bit more sure on the current approach since users 
actually complained about this a lot and now I am not seeing (so far) the 
complaints about the current approach.

{quote}
There is also the case one might be able to argue its breaking api 
compatilibity since .avro option went away, buts it a third party library so 
you can probably get away with that. 
{quote}

It's went away so I guess if the jar is provided with implicit import to 
support this, this should work as usual and use the internal implementation in 
theory. If the jar is not given, .avro API is not supported and the internal 
implmentation will be used. 



was (Author: hyukjin.kwon):
{quote}
but at the same time we aren't adding the spark.read.avro syntax so it break in 
that case or they get a different implementation by default?   
{quote}

If users call this, that's still going to use the builtin implemtnation 
(https://github.com/databricks/spark-avro/blob/branch-4.0/src/main/scala/com/databricks/spark/avro/package.scala#L26)
 as it's a short name for {{format("com.databricks.spark.avro")}}.

{quote}
our internal implementation which could very well be different.
{quote}

It wouldn't be very different for 2.4.0. It could be different but I guess it 
should be incremental improvement without behaviour changes.

{quote}
 I would rather just plain error out saying these conflict, either update or 
change your external package to use a different name. 
{quote}

IIRC, in the past, we did for CSV datasource and many users complained about 
this.

{code}
java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
{code}

In practice, I am actually a bit more sure on the current approach since users 
actually complained about his a lot and now I am not seeing (so far) the 
complains about the current approach.

{code}
There is also the case one might be able to argue its breaking api 
compatilibity since .avro option went away, buts it a third party library so 
you can probably get away with that. 
{code}

It's went away so I guess if the jar is provided with implicit import to 
support this, this should work as usual and use the internal implementation in 
theory. If the jar is not given, .avro API is not supported and the internal 
implmentation will be used. 


> Add mapping for built-in Avro data source
> -----------------------------------------
>
>                 Key: SPARK-24924
>                 URL: https://issues.apache.org/jira/browse/SPARK-24924
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Dongjoon Hyun
>            Assignee: Dongjoon Hyun
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

Reply via email to