Re: Spark DataFrames With Cache Key and Value Objects

Stuart Macdonald Fri, 27 Jul 2018 11:50:52 -0700

Val,

Yes you can already get access to the cache objects as an RDD or
Dataset but you can’t use the Ignite-optimised DataFrames with these
mechanisms. Optimised DataFrames have to be passed through Spark SQL’s
Catalyst engine to allow for predicate pushdown to Ignite. So the
usecase we’re talking about here is when we want to be able to push
Spark filters/joins to Ignite to optimise, but still have access to
the underlying cache objects, which is not possible currently.


Can you elaborate on the reason _key and _val columns in Ignite SQL
will be removed?

Stuart.

> On 27 Jul 2018, at 19:39, Valentin Kulichenko <[email protected]> 
> wrote:
>
> Stuart, Nikolay,
>
> I really don't like the idea of exposing '_key' and '_val' fields. This is
> legacy stuff that hopefully will be removed altogether one day. Let's not
> use it in new features.
>
> Actually, I don't even think it's even needed. Spark docs [1] suggest two
> ways of creating a typed dataset:
> 1. Based on RDD. This should be supported using IgniteRDD I believe.
> 2. Based on DataFrame providing a class. This would just work out of the
> box I guess.
>
> Of course, this needs to be tested and verified, and there might be certain
> pieces missing to fully support the use case. But generally I like these
> approaches much more.
>
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
>
> -Val
>
>> On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <[email protected]> wrote:
>>
>> Here’s the ticket:
>>
>> https://issues.apache.org/jira/browse/IGNITE-9108
>>
>> Stuart.
>>
>>
>>> On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
>>>
>>> Sure.
>>>
>>> Please, send ticket number in this thread.
>>>
>>> пт, 27 июля 2018 г., 16:16 Stuart Macdonald <[email protected] (mailto:
>> [email protected])>:
>>>
>>>> Thanks Nikolay. For both options if the cache object isn’t a simple
>> type,
>>>> we’d probably do something like this in our Ignite SQL statement:
>>>>
>>>> select cast(_key as binary), cast(_val as binary), ...
>>>>
>>>> Which would give us the BinaryObject’s byte[], then for option 1 we
>> keep
>>>> the Ignite format and introduce a new Spark Encoder for Ignite binary
>> types
>>>> (
>>>>
>>>>
>> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
>>>> ),
>>>> so that the end user interface would be something like:
>>>>
>>>> IgniteSparkSession session = ...
>>>> Dataset<Row> dataFrame = ...
>>>> Dataset<MyValClass> valDataSet =
>>>>
>> dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
>>>>
>>>> Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so that
>> the
>>>> user interface would be standard Spark:
>>>>
>>>> Dataset<Row> dataFrame = ...
>>>> DataSet<MyValClass> dataSet =
>>>> dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
>>>>
>>>> I’ll create a ticket and maybe put together a test case for further
>>>> discussion?
>>>>
>>>> Stuart.
>>>>
>>>> On 27 Jul 2018, at 09:50, Nikolay Izhikov <[email protected]
>> (mailto:[email protected])> wrote:
>>>>
>>>> Hello, Stuart.
>>>>
>>>> I like your idea.
>>>>
>>>> 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
>> Encoder
>>>> implementation for BinaryObjects
>>>>
>>>> 2. Kryo-serialised versions of the objects.
>>>>
>>>>
>>>> Seems like first option is simple adapter. Am I right?
>>>> If yes, I think it's a more efficient way comparing with
>> transformation of
>>>> each object to some other(Kryo) format.
>>>>
>>>> Can you provide some additional links for both options?
>>>> Where I can find API or(and) examples?
>>>>
>>>> As a second step, we can apply same approach to the regular key, value
>>>> caches.
>>>>
>>>> Feel free to create a ticket.
>>>>
>>>> В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
>>>>
>>>> Ignite Dev Community,
>>>>
>>>>
>>>> Within Ignite-supplied Spark DataFrames, I’d like to propose adding
>> support
>>>>
>>>> for _key and _val columns which represent the cache key and value
>> objects
>>>>
>>>> similar to the current _key/_val column semantics in Ignite SQL.
>>>>
>>>>
>>>> If the cache key or value objects are standard SQL types (eg. String,
>> Int,
>>>>
>>>> etc) they will be represented as such in the DataFrame schema,
>> otherwise
>>>>
>>>> they are represented as Binary types encoded as either: 1. Ignite
>>>>
>>>> BinaryObjects, in which case we’d need to supply a Spark Encoder
>>>>
>>>> implementation for BinaryObjects, or 2. Kryo-serialised versions of the
>>>>
>>>> objects. Option 1 would probably be more efficient but option 2 would
>> be
>>>>
>>>> more idiomatic Spark.
>>>>
>>>>
>>>> This feature would be controlled with an optional parameter in the
>> Ignite
>>>>
>>>> data source, defaulting to the current implementation which doesn’t
>> supply
>>>>
>>>> _key or _val columns. The rationale behind this is the same as the
>> Ignite
>>>>
>>>> SQL _key and _val columns: to allow access to the full cache objects
>> from a
>>>>
>>>> SQL context.
>>>>
>>>>
>>>> Can I ask for feedback on this proposal please?
>>>>
>>>>
>>>> I’d be happy to contribute this feature if we agree on the concept.
>>>>
>>>>
>>>> Stuart.
>>
>>

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to