[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

Thomas Graves (JIRA) Mon, 06 Aug 2018 06:48:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570220#comment-16570220
 ]


Thomas Graves commented on SPARK-24924:
---------------------------------------

{quote}I have followed the changes in Avro and I don't think there are big 
differences. We should keep the behaviours in particular within 2.4.0. If I 
missed some and this introduced a bug or behaviour changes, I personally think 
we should fix them within 2.4.0. That was one of key things I took into account 
when I merged some changes.
{quote}
Sorry, I wasn't meaning to claim any bugs were introduced by anyone in merging 
this in.  
{quote}In this case, users should provide their own short name of the package. 
I would say it's discouraged to use the same name with Spark's builtin 
datasources, or other packages name reserved - I wonder if users would actually 
try to have the same name in practice.

 
{quote}
I disagree with this, its already a 3rd party and not call org.apache.spark and 
they are providing their own short name that used to work before this.  Its one 
thing just referencing "avro" but when they put the entire com.databricks. we 
should not be remapping it.
{quote}We will make this in the release note - I think I listed up the possible 
stories about this in 
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708
{quote}
Yes but like anything else that requires a user to read it.  Many users just 
get new versions deployed on their cluster and if their job continues to run 
they don't notice or pay attentions.
{quote}I also realize these are 3rd party packages but I think we are making 
the assumption here based on this being a databricks package, which in my 
opinion we shouldn't. What if this was companyX package which we didn't know 
about, what would/should be the expected behavior?

I think the main reason for this is that the code is actually ported from Avro 
{{com.databricks.*}}. The problem here is a worry that {{com.databricks.*}} 
indicates the builtin Avro, right?
{quote}
Yes personally I don't think we should be remapping any third party libraries 
to apache spark. In my opinion this is even worse since we don't support the 
spark.read.avro but it happens to work if you include the databricks package, 
but it doesn't really call into the databricks code, it calls into the spark 
code. If I remove the databricks jar then spark.read.avro doesn't work. Really 
confusing to users IMHO.
{quote}For clarification, it's not personally related to me in any way at all 
but I thought we better keep it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a 
coherent point too. In that case, I think we better follow what we have done 
with CSV.
{quote}
I understand and that definitely makes sense, but I don't agree that we should 
have even done it for csv.   Unfortunately I didn't see that go in to disagree. 
  I think we should have made the message more user friendly and told them 
please update to use sparks or rename it.  We can't be responsible to keep 
compatibility with all 3rd party libraries like that.  We can't control what 
names they use. 

I'm fine with is they specify the shortname of just "avro" of having that be 
mapped to our implementation, but if they use the full com.databricks we should 
respect it or throw an error if we can't.  If everyone agrees, I can file a 
separate Jira to revert.

 

> Add mapping for built-in Avro data source
> -----------------------------------------
>
>                 Key: SPARK-24924
>                 URL: https://issues.apache.org/jira/browse/SPARK-24924
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Dongjoon Hyun
>            Assignee: Dongjoon Hyun
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

Reply via email to