Re: Possible bug in DatasourceV2

Wenchen Fan Thu, 11 Oct 2018 05:19:33 -0700

Hi Hyukjin, can you open a PR to revert it from 2.4? Now I'm kind of
convinced this is too breaking and we need more discussion.


+ Ryan Blue
Hi Ryan,
I think we need to look back at the new write API design and consider data
sources that don't have table concept. We should opt-in for the schema
validation of append operator.

On Thu, Oct 11, 2018 at 8:12 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> That's why I initially suggested to revert this part out of Spark 2.4 and
> have more discussion at 3.0 since one of the design goal of Data source V2
> is no behaviour changes to end users.
>
> 2018년 10월 11일 (목) 오후 7:11, Mendelson, Assaf <assaf.mendel...@rsa.com>님이
> 작성:
>
>> Actually, it is not just a question of a write only data source. The
>> issue is that in my case (and I imagine this is true for others), the
>> schema is not read from the database but is understood from the options.
>> This means that I have no way of understanding the schema without supplying
>> the read options. On the other hand, when writing, I have the schema from
>> the dataframe.
>>
>>
>>
>> I know the data source V2 API is considered experimental API and I have
>> no problem with it, however, this means that the change will require a
>> change in how the end user works with it (they suddenly need to add schema
>> information which they did not before), not to mention this being a
>> regression.
>>
>>
>>
>> As to the pull request, this only handles cases where the save mode is
>> not append, for the original example (having non existent path but have
>> append will still fail and according to the documentation of Append, if the
>> path does not exist it should create it).
>>
>>
>>
>> I am currently having problem compiling everything so I can’t test it
>> myself but wouldn’t changing the relation definition in “save”:
>>
>>
>>
>> val relation = DataSourceV2Relation.create(source, options, None,
>> Option(df.schema))
>>
>>
>>
>> and changing create to look like this:
>>
>>
>>
>> def create(source: DataSourceV2, options: Map[String, String],
>> tableIdent: Option[TableIdentifier] = None, userSpecifiedSchema:
>> Option[StructType] = None): DataSourceV2Relation = {
>>
>>     val schema =
>> userSpecifiedSchema.getOrElse(source.createReader(options,
>> userSpecifiedSchema).readSchema())
>>
>>     val ident = tableIdent.orElse(tableFromOptions(options))
>>
>>     DataSourceV2Relation(
>>
>>       source, schema.toAttributes, options, ident, userSpecifiedSchema)
>>
>>   }
>>
>>
>>
>> Correct this?
>>
>>
>>
>> Or even creating a new create which simply gets the schema as non
>> optional?
>>
>>
>>
>> Thanks,
>>
>>         Assaf
>>
>>
>>
>> *From:* Hyukjin Kwon [mailto:gurwls...@gmail.com]
>> *Sent:* Thursday, October 11, 2018 10:24 AM
>> *To:* Mendelson, Assaf; Wenchen Fan
>> *Cc:* dev
>> *Subject:* Re: Possible bug in DatasourceV2
>>
>>
>>
>> [EXTERNAL EMAIL]
>> Please report any suspicious attachments, links, or requests for
>> sensitive information.
>>
>> See https://github.com/apache/spark/pull/22688
>>
>>
>>
>> +WEnchen, here looks the problem raised. This might have to be considered
>> as a blocker ...
>>
>> On Thu, 11 Oct 2018, 2:48 pm assaf.mendelson, <assaf.mendel...@rsa.com>
>> wrote:
>>
>> Hi,
>>
>> I created a datasource writer WITHOUT a reader. When I do, I get an
>> exception: org.apache.spark.sql.AnalysisException: Data source is not
>> readable: DefaultSource
>>
>> The reason for this is that when save is called, inside the source match
>> to
>> WriterSupport we have the following code:
>>
>> val source = cls.newInstance().asInstanceOf[DataSourceV2]
>>       source match {
>>         case ws: WriteSupport =>
>>           val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
>>             source,
>>             df.sparkSession.sessionState.conf)
>>           val options = sessionOptions ++ extraOptions
>> -->      val relation = DataSourceV2Relation.create(source, options)
>>
>>           if (mode == SaveMode.Append) {
>>             runCommand(df.sparkSession, "save") {
>>               AppendData.byName(relation, df.logicalPlan)
>>             }
>>
>>           } else {
>>             val writer = ws.createWriter(
>>               UUID.randomUUID.toString,
>> df.logicalPlan.output.toStructType,
>> mode,
>>               new DataSourceOptions(options.asJava))
>>
>>             if (writer.isPresent) {
>>               runCommand(df.sparkSession, "save") {
>>                 WriteToDataSourceV2(writer.get, df.logicalPlan)
>>               }
>>             }
>>           }
>>
>> but DataSourceV2Relation.create actively creates a reader
>> (source.createReader) to extract the schema:
>>
>> def create(
>>       source: DataSourceV2,
>>       options: Map[String, String],
>>       tableIdent: Option[TableIdentifier] = None,
>>       userSpecifiedSchema: Option[StructType] = None):
>> DataSourceV2Relation
>> = {
>>     val reader = source.createReader(options, userSpecifiedSchema)
>>     val ident = tableIdent.orElse(tableFromOptions(options))
>>     DataSourceV2Relation(
>>       source, reader.readSchema().toAttributes, options, ident,
>> userSpecifiedSchema)
>>   }
>>
>>
>> This makes me a little confused.
>>
>> First, the schema is defined by the dataframe itself, not by the data
>> source, i.e. it should be extracted from df.schema and not by
>> source.createReader
>>
>> Second, I see that relation is actually only use if the mode is
>> SaveMode.append (btw this means if it is needed it should be defined
>> inside
>> the "if"). I am not sure I understand the portion of the AppendData but
>> why
>> would reading from the source be included?
>>
>> Am I missing something here?
>>
>> Thanks,
>>    Assaf
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Possible bug in DatasourceV2

Reply via email to