Well, the second approach would use the optimizations, no? -Val
On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald <stu...@stuwee.org> wrote: > Val, > > Yes you can already get access to the cache objects as an RDD or > Dataset but you can’t use the Ignite-optimised DataFrames with these > mechanisms. Optimised DataFrames have to be passed through Spark SQL’s > Catalyst engine to allow for predicate pushdown to Ignite. So the > usecase we’re talking about here is when we want to be able to push > Spark filters/joins to Ignite to optimise, but still have access to > the underlying cache objects, which is not possible currently. > > Can you elaborate on the reason _key and _val columns in Ignite SQL > will be removed? > > Stuart. > > > On 27 Jul 2018, at 19:39, Valentin Kulichenko < > valentin.kuliche...@gmail.com> wrote: > > > > Stuart, Nikolay, > > > > I really don't like the idea of exposing '_key' and '_val' fields. This > is > > legacy stuff that hopefully will be removed altogether one day. Let's not > > use it in new features. > > > > Actually, I don't even think it's even needed. Spark docs [1] suggest two > > ways of creating a typed dataset: > > 1. Based on RDD. This should be supported using IgniteRDD I believe. > > 2. Based on DataFrame providing a class. This would just work out of the > > box I guess. > > > > Of course, this needs to be tested and verified, and there might be > certain > > pieces missing to fully support the use case. But generally I like these > > approaches much more. > > > > > https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets > > > > -Val > > > >> On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <stu...@stuwee.org> > wrote: > >> > >> Here’s the ticket: > >> > >> https://issues.apache.org/jira/browse/IGNITE-9108 > >> > >> Stuart. > >> > >> > >>> On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote: > >>> > >>> Sure. > >>> > >>> Please, send ticket number in this thread. > >>> > >>> пт, 27 июля 2018 г., 16:16 Stuart Macdonald <stu...@stuwee.org > (mailto: > >> stu...@stuwee.org)>: > >>> > >>>> Thanks Nikolay. For both options if the cache object isn’t a simple > >> type, > >>>> we’d probably do something like this in our Ignite SQL statement: > >>>> > >>>> select cast(_key as binary), cast(_val as binary), ... > >>>> > >>>> Which would give us the BinaryObject’s byte[], then for option 1 we > >> keep > >>>> the Ignite format and introduce a new Spark Encoder for Ignite binary > >> types > >>>> ( > >>>> > >>>> > >> > https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html > >>>> ), > >>>> so that the end user interface would be something like: > >>>> > >>>> IgniteSparkSession session = ... > >>>> Dataset<Row> dataFrame = ... > >>>> Dataset<MyValClass> valDataSet = > >>>> > >> > dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class)) > >>>> > >>>> Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so that > >> the > >>>> user interface would be standard Spark: > >>>> > >>>> Dataset<Row> dataFrame = ... > >>>> DataSet<MyValClass> dataSet = > >>>> dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class)) > >>>> > >>>> I’ll create a ticket and maybe put together a test case for further > >>>> discussion? > >>>> > >>>> Stuart. > >>>> > >>>> On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org > >> (mailto:nizhi...@apache.org)> wrote: > >>>> > >>>> Hello, Stuart. > >>>> > >>>> I like your idea. > >>>> > >>>> 1. Ignite BinaryObjects, in which case we’d need to supply a Spark > >> Encoder > >>>> implementation for BinaryObjects > >>>> > >>>> 2. Kryo-serialised versions of the objects. > >>>> > >>>> > >>>> Seems like first option is simple adapter. Am I right? > >>>> If yes, I think it's a more efficient way comparing with > >> transformation of > >>>> each object to some other(Kryo) format. > >>>> > >>>> Can you provide some additional links for both options? > >>>> Where I can find API or(and) examples? > >>>> > >>>> As a second step, we can apply same approach to the regular key, value > >>>> caches. > >>>> > >>>> Feel free to create a ticket. > >>>> > >>>> В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет: > >>>> > >>>> Ignite Dev Community, > >>>> > >>>> > >>>> Within Ignite-supplied Spark DataFrames, I’d like to propose adding > >> support > >>>> > >>>> for _key and _val columns which represent the cache key and value > >> objects > >>>> > >>>> similar to the current _key/_val column semantics in Ignite SQL. > >>>> > >>>> > >>>> If the cache key or value objects are standard SQL types (eg. String, > >> Int, > >>>> > >>>> etc) they will be represented as such in the DataFrame schema, > >> otherwise > >>>> > >>>> they are represented as Binary types encoded as either: 1. Ignite > >>>> > >>>> BinaryObjects, in which case we’d need to supply a Spark Encoder > >>>> > >>>> implementation for BinaryObjects, or 2. Kryo-serialised versions of > the > >>>> > >>>> objects. Option 1 would probably be more efficient but option 2 would > >> be > >>>> > >>>> more idiomatic Spark. > >>>> > >>>> > >>>> This feature would be controlled with an optional parameter in the > >> Ignite > >>>> > >>>> data source, defaulting to the current implementation which doesn’t > >> supply > >>>> > >>>> _key or _val columns. The rationale behind this is the same as the > >> Ignite > >>>> > >>>> SQL _key and _val columns: to allow access to the full cache objects > >> from a > >>>> > >>>> SQL context. > >>>> > >>>> > >>>> Can I ask for feedback on this proposal please? > >>>> > >>>> > >>>> I’d be happy to contribute this feature if we agree on the concept. > >>>> > >>>> > >>>> Stuart. > >> > >> >