[
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570220#comment-16570220
]
Thomas Graves commented on SPARK-24924:
---------------------------------------
{quote}I have followed the changes in Avro and I don't think there are big
differences. We should keep the behaviours in particular within 2.4.0. If I
missed some and this introduced a bug or behaviour changes, I personally think
we should fix them within 2.4.0. That was one of key things I took into account
when I merged some changes.
{quote}
Sorry, I wasn't meaning to claim any bugs were introduced by anyone in merging
this in.
{quote}In this case, users should provide their own short name of the package.
I would say it's discouraged to use the same name with Spark's builtin
datasources, or other packages name reserved - I wonder if users would actually
try to have the same name in practice.
{quote}
I disagree with this, its already a 3rd party and not call org.apache.spark and
they are providing their own short name that used to work before this. Its one
thing just referencing "avro" but when they put the entire com.databricks. we
should not be remapping it.
{quote}We will make this in the release note - I think I listed up the possible
stories about this in
https://issues.apache.org/jira/browse/SPARK-24924?focusedCommentId=16567708&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16567708
{quote}
Yes but like anything else that requires a user to read it. Many users just
get new versions deployed on their cluster and if their job continues to run
they don't notice or pay attentions.
{quote}I also realize these are 3rd party packages but I think we are making
the assumption here based on this being a databricks package, which in my
opinion we shouldn't. What if this was companyX package which we didn't know
about, what would/should be the expected behavior?
I think the main reason for this is that the code is actually ported from Avro
{{com.databricks.*}}. The problem here is a worry that {{com.databricks.*}}
indicates the builtin Avro, right?
{quote}
Yes personally I don't think we should be remapping any third party libraries
to apache spark. In my opinion this is even worse since we don't support the
spark.read.avro but it happens to work if you include the databricks package,
but it doesn't really call into the databricks code, it calls into the spark
code. If I remove the databricks jar then spark.read.avro doesn't work. Really
confusing to users IMHO.
{quote}For clarification, it's not personally related to me in any way at all
but I thought we better keep it consistent with CSV's.
To sum up, I get your position but I think the current approach makes a
coherent point too. In that case, I think we better follow what we have done
with CSV.
{quote}
I understand and that definitely makes sense, but I don't agree that we should
have even done it for csv. Unfortunately I didn't see that go in to disagree.
I think we should have made the message more user friendly and told them
please update to use sparks or rename it. We can't be responsible to keep
compatibility with all 3rd party libraries like that. We can't control what
names they use.
I'm fine with is they specify the shortname of just "avro" of having that be
mapped to our implementation, but if they use the full com.databricks we should
respect it or throw an error if we can't. If everyone agrees, I can file a
separate Jira to revert.
> Add mapping for built-in Avro data source
> -----------------------------------------
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.4.0
> Reporter: Dongjoon Hyun
> Assignee: Dongjoon Hyun
> Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
> # Like `com.databricks.spark.csv` mapping, we had better map
> `com.databricks.spark.avro` to built-in Avro data source.
> # Remove incorrect error message, `Please find an Avro package at ...`.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]