Re: Spark DataFrames With Cache Key and Value Objects

Nikolay Izhikov Tue, 31 Jul 2018 02:50:18 -0700

Hello, Igniters.

Valentin,


> We never recommend to use these fields

Actually, we did:

        * Documentation [1]. Please, see "Predefined Fields" section.
        * Java Example [2]
        * DotNet Example [3]
        * Scala Example [4]

> ...hopefully will be removed altogether one day

This is new for me.

Do we have specific plans for it?

[1] https://apacheignite-sql.readme.io/docs/schema-and-indexes
[2] 
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/sql/SqlDmlExample.java#L88
[3] 
https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/examples/Apache.Ignite.Examples/Sql/SqlDmlExample.cs#L91
[4] 
https://github.com/apache/ignite/blob/master/examples/src/main/scala/org/apache/ignite/scalar/examples/ScalarCachePopularNumbersExample.scala#L124

В Пт, 27/07/2018 в 15:22 -0700, Valentin Kulichenko пишет:
> Stuart,
> 
> _key and _val fields is quite a dirty hack that was added years ago and is
> virtually never used now. We never recommend to use these fields and I
> would definitely avoid building new features based on them.
> 
> Having said that, I'm not arguing the use case, but we need better
> implementation approach here. I suggest we think it over and come back to
> this next week :) I'm sure Nikolay will also chime in and share his
> thoughts.
> 
> -Val
> 
> On Fri, Jul 27, 2018 at 12:39 PM Stuart Macdonald <[email protected]> wrote:
> 
> > If your predicates and joins are expressed in Spark SQL, you cannot
> > currently optimise those and also gain access to the key/val objects. If
> > you went without the Ignite Spark SQL optimisations and expressed your
> > query in Ignite SQL, you still need to use the _key/_val columns. The
> > Ignite documentation has this specific example using the _val column (right
> > at the end):
> > https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd
> > 
> > Stuart.
> > 
> > On 27 Jul 2018, at 20:05, Valentin Kulichenko <
> > [email protected]>
> > wrote:
> > 
> > Well, the second approach would use the optimizations, no?
> > 
> > -Val
> > 
> > 
> > On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald <[email protected]>
> > wrote:
> > 
> > Val,
> > 
> > 
> > Yes you can already get access to the cache objects as an RDD or
> > 
> > Dataset but you can’t use the Ignite-optimised DataFrames with these
> > 
> > mechanisms. Optimised DataFrames have to be passed through Spark SQL’s
> > 
> > Catalyst engine to allow for predicate pushdown to Ignite. So the
> > 
> > usecase we’re talking about here is when we want to be able to push
> > 
> > Spark filters/joins to Ignite to optimise, but still have access to
> > 
> > the underlying cache objects, which is not possible currently.
> > 
> > 
> > Can you elaborate on the reason _key and _val columns in Ignite SQL
> > 
> > will be removed?
> > 
> > 
> > Stuart.
> > 
> > 
> > On 27 Jul 2018, at 19:39, Valentin Kulichenko <
> > 
> > [email protected]> wrote:
> > 
> > 
> > Stuart, Nikolay,
> > 
> > 
> > I really don't like the idea of exposing '_key' and '_val' fields. This
> > 
> > is
> > 
> > legacy stuff that hopefully will be removed altogether one day. Let's not
> > 
> > use it in new features.
> > 
> > 
> > Actually, I don't even think it's even needed. Spark docs [1] suggest two
> > 
> > ways of creating a typed dataset:
> > 
> > 1. Based on RDD. This should be supported using IgniteRDD I believe.
> > 
> > 2. Based on DataFrame providing a class. This would just work out of the
> > 
> > box I guess.
> > 
> > 
> > Of course, this needs to be tested and verified, and there might be
> > 
> > certain
> > 
> > pieces missing to fully support the use case. But generally I like these
> > 
> > approaches much more.
> > 
> > 
> > 
> > 
> > https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
> > 
> > 
> > -Val
> > 
> > 
> > On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <[email protected]>
> > 
> > wrote:
> > 
> > 
> > Here’s the ticket:
> > 
> > 
> > https://issues.apache.org/jira/browse/IGNITE-9108
> > 
> > 
> > Stuart.
> > 
> > 
> > 
> > On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
> > 
> > 
> > Sure.
> > 
> > 
> > Please, send ticket number in this thread.
> > 
> > 
> > пт, 27 июля 2018 г., 16:16 Stuart Macdonald <[email protected]
> > 
> > (mailto:
> > 
> > [email protected])>:
> > 
> > 
> > Thanks Nikolay. For both options if the cache object isn’t a simple
> > 
> > type,
> > 
> > we’d probably do something like this in our Ignite SQL statement:
> > 
> > 
> > select cast(_key as binary), cast(_val as binary), ...
> > 
> > 
> > Which would give us the BinaryObject’s byte[], then for option 1 we
> > 
> > keep
> > 
> > the Ignite format and introduce a new Spark Encoder for Ignite binary
> > 
> > types
> > 
> > (
> > 
> > 
> > 
> > 
> > 
> > https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
> > 
> > ),
> > 
> > so that the end user interface would be something like:
> > 
> > 
> > IgniteSparkSession session = ...
> > 
> > Dataset<Row> dataFrame = ...
> > 
> > Dataset<MyValClass> valDataSet =
> > 
> > 
> > 
> > dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
> > 
> > 
> > Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so that
> > 
> > the
> > 
> > user interface would be standard Spark:
> > 
> > 
> > Dataset<Row> dataFrame = ...
> > 
> > DataSet<MyValClass> dataSet =
> > 
> > dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
> > 
> > 
> > I’ll create a ticket and maybe put together a test case for further
> > 
> > discussion?
> > 
> > 
> > Stuart.
> > 
> > 
> > On 27 Jul 2018, at 09:50, Nikolay Izhikov <[email protected]
> > 
> > (mailto:[email protected] <[email protected]>)> wrote:
> > 
> > 
> > Hello, Stuart.
> > 
> > 
> > I like your idea.
> > 
> > 
> > 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
> > 
> > Encoder
> > 
> > implementation for BinaryObjects
> > 
> > 
> > 2. Kryo-serialised versions of the objects.
> > 
> > 
> > 
> > Seems like first option is simple adapter. Am I right?
> > 
> > If yes, I think it's a more efficient way comparing with
> > 
> > transformation of
> > 
> > each object to some other(Kryo) format.
> > 
> > 
> > Can you provide some additional links for both options?
> > 
> > Where I can find API or(and) examples?
> > 
> > 
> > As a second step, we can apply same approach to the regular key, value
> > 
> > caches.
> > 
> > 
> > Feel free to create a ticket.
> > 
> > 
> > В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
> > 
> > 
> > Ignite Dev Community,
> > 
> > 
> > 
> > Within Ignite-supplied Spark DataFrames, I’d like to propose adding
> > 
> > support
> > 
> > 
> > for _key and _val columns which represent the cache key and value
> > 
> > objects
> > 
> > 
> > similar to the current _key/_val column semantics in Ignite SQL.
> > 
> > 
> > 
> > If the cache key or value objects are standard SQL types (eg. String,
> > 
> > Int,
> > 
> > 
> > etc) they will be represented as such in the DataFrame schema,
> > 
> > otherwise
> > 
> > 
> > they are represented as Binary types encoded as either: 1. Ignite
> > 
> > 
> > BinaryObjects, in which case we’d need to supply a Spark Encoder
> > 
> > 
> > implementation for BinaryObjects, or 2. Kryo-serialised versions of
> > 
> > the
> > 
> > 
> > objects. Option 1 would probably be more efficient but option 2 would
> > 
> > be
> > 
> > 
> > more idiomatic Spark.
> > 
> > 
> > 
> > This feature would be controlled with an optional parameter in the
> > 
> > Ignite
> > 
> > 
> > data source, defaulting to the current implementation which doesn’t
> > 
> > supply
> > 
> > 
> > _key or _val columns. The rationale behind this is the same as the
> > 
> > Ignite
> > 
> > 
> > SQL _key and _val columns: to allow access to the full cache objects
> > 
> > from a
> > 
> > 
> > SQL context.
> > 
> > 
> > 
> > Can I ask for feedback on this proposal please?
> > 
> > 
> > 
> > I’d be happy to contribute this feature if we agree on the concept.
> > 
> > 
> > 
> > Stuart.
> >

signature.asc
Description: This is a digitally signed message part

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to