Re: Spark DataFrames With Cache Key and Value Objects

Valentin Kulichenko Fri, 27 Jul 2018 12:06:33 -0700

Well, the second approach would use the optimizations, no?

-Val



On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald <stu...@stuwee.org> wrote:

> Val,
>
> Yes you can already get access to the cache objects as an RDD or
> Dataset but you can’t use the Ignite-optimised DataFrames with these
> mechanisms. Optimised DataFrames have to be passed through Spark SQL’s
> Catalyst engine to allow for predicate pushdown to Ignite. So the
> usecase we’re talking about here is when we want to be able to push
> Spark filters/joins to Ignite to optimise, but still have access to
> the underlying cache objects, which is not possible currently.
>
> Can you elaborate on the reason _key and _val columns in Ignite SQL
> will be removed?
>
> Stuart.
>
> > On 27 Jul 2018, at 19:39, Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
> >
> > Stuart, Nikolay,
> >
> > I really don't like the idea of exposing '_key' and '_val' fields. This
> is
> > legacy stuff that hopefully will be removed altogether one day. Let's not
> > use it in new features.
> >
> > Actually, I don't even think it's even needed. Spark docs [1] suggest two
> > ways of creating a typed dataset:
> > 1. Based on RDD. This should be supported using IgniteRDD I believe.
> > 2. Based on DataFrame providing a class. This would just work out of the
> > box I guess.
> >
> > Of course, this needs to be tested and verified, and there might be
> certain
> > pieces missing to fully support the use case. But generally I like these
> > approaches much more.
> >
> >
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
> >
> > -Val
> >
> >> On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <stu...@stuwee.org>
> wrote:
> >>
> >> Here’s the ticket:
> >>
> >> https://issues.apache.org/jira/browse/IGNITE-9108
> >>
> >> Stuart.
> >>
> >>
> >>> On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
> >>>
> >>> Sure.
> >>>
> >>> Please, send ticket number in this thread.
> >>>
> >>> пт, 27 июля 2018 г., 16:16 Stuart Macdonald <stu...@stuwee.org
> (mailto:
> >> stu...@stuwee.org)>:
> >>>
> >>>> Thanks Nikolay. For both options if the cache object isn’t a simple
> >> type,
> >>>> we’d probably do something like this in our Ignite SQL statement:
> >>>>
> >>>> select cast(_key as binary), cast(_val as binary), ...
> >>>>
> >>>> Which would give us the BinaryObject’s byte[], then for option 1 we
> >> keep
> >>>> the Ignite format and introduce a new Spark Encoder for Ignite binary
> >> types
> >>>> (
> >>>>
> >>>>
> >>
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
> >>>> ),
> >>>> so that the end user interface would be something like:
> >>>>
> >>>> IgniteSparkSession session = ...
> >>>> Dataset<Row> dataFrame = ...
> >>>> Dataset<MyValClass> valDataSet =
> >>>>
> >>
> dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
> >>>>
> >>>> Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so that
> >> the
> >>>> user interface would be standard Spark:
> >>>>
> >>>> Dataset<Row> dataFrame = ...
> >>>> DataSet<MyValClass> dataSet =
> >>>> dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
> >>>>
> >>>> I’ll create a ticket and maybe put together a test case for further
> >>>> discussion?
> >>>>
> >>>> Stuart.
> >>>>
> >>>> On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org
> >> (mailto:nizhi...@apache.org)> wrote:
> >>>>
> >>>> Hello, Stuart.
> >>>>
> >>>> I like your idea.
> >>>>
> >>>> 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
> >> Encoder
> >>>> implementation for BinaryObjects
> >>>>
> >>>> 2. Kryo-serialised versions of the objects.
> >>>>
> >>>>
> >>>> Seems like first option is simple adapter. Am I right?
> >>>> If yes, I think it's a more efficient way comparing with
> >> transformation of
> >>>> each object to some other(Kryo) format.
> >>>>
> >>>> Can you provide some additional links for both options?
> >>>> Where I can find API or(and) examples?
> >>>>
> >>>> As a second step, we can apply same approach to the regular key, value
> >>>> caches.
> >>>>
> >>>> Feel free to create a ticket.
> >>>>
> >>>> В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
> >>>>
> >>>> Ignite Dev Community,
> >>>>
> >>>>
> >>>> Within Ignite-supplied Spark DataFrames, I’d like to propose adding
> >> support
> >>>>
> >>>> for _key and _val columns which represent the cache key and value
> >> objects
> >>>>
> >>>> similar to the current _key/_val column semantics in Ignite SQL.
> >>>>
> >>>>
> >>>> If the cache key or value objects are standard SQL types (eg. String,
> >> Int,
> >>>>
> >>>> etc) they will be represented as such in the DataFrame schema,
> >> otherwise
> >>>>
> >>>> they are represented as Binary types encoded as either: 1. Ignite
> >>>>
> >>>> BinaryObjects, in which case we’d need to supply a Spark Encoder
> >>>>
> >>>> implementation for BinaryObjects, or 2. Kryo-serialised versions of
> the
> >>>>
> >>>> objects. Option 1 would probably be more efficient but option 2 would
> >> be
> >>>>
> >>>> more idiomatic Spark.
> >>>>
> >>>>
> >>>> This feature would be controlled with an optional parameter in the
> >> Ignite
> >>>>
> >>>> data source, defaulting to the current implementation which doesn’t
> >> supply
> >>>>
> >>>> _key or _val columns. The rationale behind this is the same as the
> >> Ignite
> >>>>
> >>>> SQL _key and _val columns: to allow access to the full cache objects
> >> from a
> >>>>
> >>>> SQL context.
> >>>>
> >>>>
> >>>> Can I ask for feedback on this proposal please?
> >>>>
> >>>>
> >>>> I’d be happy to contribute this feature if we agree on the concept.
> >>>>
> >>>>
> >>>> Stuart.
> >>
> >>
>

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to