Re: Table Names in Spark Catalog

Stuart Macdonald Sun, 26 Aug 2018 06:10:08 -0700

I'll go ahead and make the changes to represent the schema name as the
database name for the purposes of the Spark catalog.


If anyone knows of an existing way to list all available schemata within an
Ignite instance please let me know, otherwise the first task will be
creating that mechanism.

Stuart.

On Fri, Aug 24, 2018 at 6:23 PM Valentin Kulichenko <
[email protected]> wrote:

> Nikolay,
>
> If there are multiple configuration in XML, IgniteContext will always use
> only one of them. Looks like current approach simply doesn't work. I
> propose to report schema name as 'database' in Spark. If there are multiple
> clients, you would create multiple sessions and multiple catalogs.
>
> Makes sense?
>
> -Val
>
> On Fri, Aug 24, 2018 at 12:33 AM Nikolay Izhikov <[email protected]>
> wrote:
>
> > Hello, Valentin.
> >
> > > catalog exist in scope of a single IgniteSparkSession> (and therefore
> > single IgniteContext and single Ignite instance)?
> >
> > Yes.
> > Actually, I was thinking about use case when we have several Ignite
> > configuration in one XML file.
> > Now I see, may be this is too rare use-case to support.
> >
> > Stuart, Valentin, What is your proposal?
> >
> > В Ср, 22/08/2018 в 08:56 -0700, Valentin Kulichenko пишет:
> > > Nikolay,
> > >
> > > Whatever we decide on would be right :) Basically, we need to answer
> this
> > > question: does the catalog exist in scope of a single
> IgniteSparkSession
> > > (and therefore single IgniteContext and single Ignite instance)? In
> other
> > > words, in case of a rare use case when a single Spark application
> > connects
> > > to multiple Ignite clusters, would there be a catalog created per
> > cluster?
> > >
> > > If the answer is yes, current logic doesn't make sense.
> > >
> > > -Val
> > >
> > >
> > > On Wed, Aug 22, 2018 at 1:44 AM Nikolay Izhikov <[email protected]>
> > wrote:
> > >
> > > > Hello, Valentin.
> > > >
> > > > > I believe we should get rid of this logic and use Ignite schema
> name
> > as
> > > >
> > > > database name in Spark's catalog.
> > > >
> > > > When I develop Ignite integration with Spark Data Frame I use
> following
> > > > abstraction described by Vladimir Ozerov:
> > > >
> > > > "1) Let's consider Ignite cluster as a single database ("catalog" in
> > ANSI
> > > > SQL'92 terms)." [1]
> > > >
> > > > Am I was wrong? If yes - let's fix it.
> > > >
> > > > [1]
> > > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/SQL-usability-catalogs-schemas-and-tables-td17148.html
> > > >
> > > > В Ср, 22/08/2018 в 09:26 +0100, Stuart Macdonald пишет:
> > > > > Hi Val, yes that's correct. I'd be happy to make the change to have
> > the
> > > > > database reference the schema if Nikolay agrees. (I'll first need
> to
> > do a
> > > > > bit of research into how to obtain the list of all available
> > schemata...)
> > > > >
> > > > > Thanks,
> > > > > Stuart.
> > > > >
> > > > > On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Stuart,
> > > > > >
> > > > > > Thanks for pointing this out, I was not aware that we use Spark
> > > >
> > > > database
> > > > > > concept this way. Actually, this confuses me a lot. As far as I
> > > >
> > > > understand,
> > > > > > catalog is created in the scope of a particular
> IgniteSparkSession,
> > > >
> > > > which
> > > > > > in turn is assigned to a particular IgniteContext and therefore
> > single
> > > > > > Ignite client. If that's the case, I don't think it should be
> > aware of
> > > > > > other Ignite clients that are connected to other clusters. This
> > doesn't
> > > > > > look like correct behavior to me, not to mention that with this
> > > >
> > > > approach
> > > > > > having multiple databases would be a very rare case. I believe we
> > > >
> > > > should
> > > > > > get rid of this logic and use Ignite schema name as database name
> > in
> > > > > > Spark's catalog.
> > > > > >
> > > > > > Nikolay, what do you think?
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Nikolay, Val,
> > > > > > >
> > > > > > > The JDBC Spark datasource[1] -- as far as I can tell -- has no
> > > > > > > ExternalCatalog implementation, it just uses the database
> > specified
> > > >
> > > > in the
> > > > > > > JDBC URL. So I don't believe there is any way to call
> > listTables() or
> > > > > > > listDatabases() for JDBC provider.
> > > > > > >
> > > > > > > The Hive ExternalCatalog[2] makes the distinction between
> > database
> > > >
> > > > and
> > > > > > > table using the actual database and table mechanisms built into
> > the
> > > > > > > catalog, which is fine because Hive has the clear distinction
> and
> > > > > > > hierarchy
> > > > > > > of databases and tables.
> > > > > > >
> > > > > > > *However* Ignite already uses the "database" concept in the
> > Ignite
> > > > > > >
> > > > > > > ExternalCatalog[3] to mean the name of an Ignite instance. So
> in
> > > >
> > > > Ignite we
> > > > > > > have instances containing schemas containing tables, and Spark
> > only
> > > >
> > > > has
> > > > > > > the
> > > > > > > concept of databases and tables so it seems like either we
> ignore
> > > >
> > > > one of
> > > > > > > the three Ignite concepts or combine two of them into database
> or
> > > >
> > > > table.
> > > > > > > The current implementation in the pull request combines Ignite
> > > >
> > > > schema and
> > > > > > > table attributes into the Spark table attribute.
> > > > > > >
> > > > > > > Stuart.
> > > > > > >
> > > > > > > [1]
> > > > > > > https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > > > > > > datasources/jdbc/JDBCRelation.scala
> > > > > > > [2]
> > > > > > > https://github.com/apache/spark/blob/master/sql/hive/
> > > > > > >
> > src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
> > > > > > > [3]
> > > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > > spark/src/main/scala/org/apache/spark/sql/ignite/
> > > > > > > IgniteExternalCatalog.scala
> > > > > > >
> > > > > > > On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <
> > > >
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello, Stuart.
> > > > > > > >
> > > > > > > > Can you do some research and find out how schema is handled
> in
> > Data
> > > > > > >
> > > > > > > Frames
> > > > > > > > for a regular RDBMS such as Oracle, MySQL, etc?
> > > > > > > >
> > > > > > > > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
> > > > > > > > > Stuart, Nikolay,
> > > > > > > > >
> > > > > > > > > I see that the 'Table' class (returned by listTables
> method)
> > has
> > > >
> > > > a
> > > > > > > >
> > > > > > > > 'database' field. Can we use this one to report schema name?
> > > > > > > > >
> > > > > > > > > In any case, I think we should look into how this is done
> in
> > data
> > > > > > >
> > > > > > > source
> > > > > > > > implementations for other databases. Any relational database
> > has a
> > > > > > >
> > > > > > > notion
> > > > > > > > of schema, and I'm sure Spark integrations take this into
> > account
> > > > > > >
> > > > > > > somehow.
> > > > > > > > >
> > > > > > > > > -Val
> > > > > > > > >
> > > > > > > > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <
> > > >
> > > > [email protected]>
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > Hello, Stuart.
> > > > > > > > > >
> > > > > > > > > > Personally, I think we should change current tables
> naming
> > and
> > > > > > >
> > > > > > > return
> > > > > > > > table in form of `schema.table`.
> > > > > > > > > >
> > > > > > > > > > Valentin, could you share your opinion?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
> > > > > > > > > > > Igniters,
> > > > > > > > > > >
> > > > > > > > > > > While reviewing the changes for IGNITE-9228 [1,2],
> > Nikolay
> > > >
> > > > and I
> > > > > > >
> > > > > > > are
> > > > > > > > > > > discussing whether to introduce a change which may
> impact
> > > > > > >
> > > > > > > backwards
> > > > > > > > > > > compatibility; Nikolay suggested we take the discussion
> > to
> > > >
> > > > this
> > > > > > >
> > > > > > > list.
> > > > > > > > > > >
> > > > > > > > > > > Ignite implements a custom Spark catalog which provides
> > an
> > > >
> > > > API by
> > > > > > > >
> > > > > > > > which
> > > > > > > > > > > Spark users can list the tables which are available in
> > Ignite
> > > > > > >
> > > > > > > which
> > > > > > > > can be
> > > > > > > > > > > queried via Spark SQL. Currently that table name list
> > > >
> > > > includes
> > > > > > >
> > > > > > > just
> > > > > > > > the
> > > > > > > > > > > names of the tables, but IGNITE-9228 is introducing a
> > change
> > > >
> > > > which
> > > > > > > >
> > > > > > > > allows
> > > > > > > > > > > optional prefixing of schema names to table names to
> > > >
> > > > disambiguate
> > > > > > > >
> > > > > > > > multiple
> > > > > > > > > > > tables with the same name in different schemas. For the
> > "list
> > > > > > > >
> > > > > > > > tables" API
> > > > > > > > > > > we therefore have two options:
> > > > > > > > > > >
> > > > > > > > > > > 1. List the tables using both their table names and
> > > > > > >
> > > > > > > schema-qualified
> > > > > > > > table
> > > > > > > > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even
> though
> > > >
> > > > they are
> > > > > > > >
> > > > > > > > the same
> > > > > > > > > > > underlying table. This retains backwards compatibility
> > with
> > > >
> > > > users
> > > > > > >
> > > > > > > who
> > > > > > > > > > > expect "myTable" to appear in the catalog.
> > > > > > > > > > > 2. List the tables using only their schema-qualified
> > names.
> > > >
> > > > This
> > > > > > > >
> > > > > > > > eliminates
> > > > > > > > > > > duplication of names in the catalog but will
> potentially
> > > >
> > > > break
> > > > > > > > > > > compatibility with users who expect the table name in
> the
> > > >
> > > > catalog.
> > > > > > > > > > >
> > > > > > > > > > > With either option we will allow for  Spark SQL SELECT
> > > >
> > > > statements
> > > > > > >
> > > > > > > to
> > > > > > > > use
> > > > > > > > > > > either table name or schema-qualified table names, this
> > > >
> > > > change
> > > > > > >
> > > > > > > would
> > > > > > > > purely
> > > > > > > > > > > impact the API which is used to list available tables.
> > > > > > > > > > >
> > > > > > > > > > > Any opinions would be welcome.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stuart.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
> > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4551
>

Re: Table Names in Spark Catalog

Reply via email to