[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

Hyukjin Kwon (JIRA) Fri, 03 Aug 2018 22:42:44 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569071#comment-16569071
 ]


Hyukjin Kwon commented on SPARK-24924:
--------------------------------------

If it already throws an error for CSV case too, I would prefer to have the 
improved error message of course.

{quote}
I don't buy this agrument, the code has been restructured a lot and you could 
have introduced bugs, behavior changes, etc.
{quote}

I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.

{quote}
Users could have also made their own modified version of the databricks 
spark-avro package (which we actually have to support primitive types) and thus 
the implementation is not the same and yet you are assuming it is.  
{quote}

In this case, users should provide their own short name of the package. I would 
say it's discouraged to use the same name with Spark's builtin datasources, or 
other packages name reserved - I wonder if users would actually try to have the 
same name in practice.

{quote}
 I'm worried about other users who didn't happen to see this jira.
{quote}

We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708

{quote}
I also realize these are 3rd party packages but I think we are making the 
assumption here based on this being a databricks package, which in my opinion 
we shouldn't.   What if this was companyX package which we didn't know about, 
what would/should be the expected behavior? 
{quote}

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.\*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right? 

{quote}
How many users complained about the csv thing? 
{quote}

So far, I see some issues as below:

https://github.com/databricks/spark-csv/issues/367
https://github.com/databricks/spark-csv/issues/373
https://github.com/apache/spark/pull/17916#issuecomment-301898567

For clarification, it's related to me in any way but I thought we better keep 
it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.



> Add mapping for built-in Avro data source
> -----------------------------------------
>
>                 Key: SPARK-24924
>                 URL: https://issues.apache.org/jira/browse/SPARK-24924
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Dongjoon Hyun
>            Assignee: Dongjoon Hyun
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

Reply via email to