[Arrow][Dremio]

2018-05-13 Thread xmehaut
Hello,
I've some question about Spark and Apache Arrow. Up to now, Arrow is only
used for sharing data between Python and Spark executors instead of
transmitting them through sockets. I'm studying currently Dremio as an
interesting way to access multiple sources of data, and as a potential
replacement of ETL tools, included sparksql.
It seems, if the promises are actually right, that arrow and dremio may be
changing game for these two purposes (data source abstraction, etl tasks),
leaving then spark on te two following goals , ie ml/dl and graph
processing, which can be a danger for spark at middle term with the arising
of multiple frameworks in these areas.
My question is then :
- is there a means to use arrow more broadly in spark itself and not only
for sharing data?
- what are the strenghts and weaknesses of spark wrt Arrow and consequently
Dremio?
- What is the difference finally between databricks DBIO and Dremio/arrow?
-How do you see the future of spark regarding these assumptions?
regards 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-13 Thread Jacek Laskowski
Hi,

The exception message should be self-explanatory and says that you cannot
join two streaming Datasets. This feature was added in 2.3 if I'm not
mistaken.

Just to be sure that you work with two streaming Datasets, can you show the
query plan of the join query?

Jacek

On Sat, 12 May 2018, 16:57 ThomasThomas,  wrote:

> Hi There,
>
> Our use case is like this.
>
> We have a nested(multiple) JSON message flowing through Kafka Queue.  Read
> the message from Kafka using Spark Structured Streaming(SSS) and  explode
> the data and flatten all data into single record using DataFrame joins and
> land into a relational database table(DB2).
>
> But we are getting the following error when we write into db using JDBC.
>
> “org.apache.spark.sql.AnalysisException: Inner join between two streaming
> DataFrames/Datasets is not supported;”
>
> Any help would be greatly appreciated.
>
> Thanks,
> Thomas Thomas
> Mastermind Solutions LLC.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Measure performance time in some spark transformations.

2018-05-13 Thread Jörn Franke
Can’t you find this in the Spark UI or timeline server?

> On 13. May 2018, at 00:31, Guillermo Ortiz Fernández 
>  wrote:
> 
> I want to measure how long it takes some different transformations in Spark 
> as map, joinWithCassandraTable and so on.  Which one is the best aproximation 
> to do it? 
> 
> def time[R](block: => R): R = {
> val t0 = System.nanoTime()
> val result = block   
> val t1 = System.nanoTime()
> println("Elapsed time: " + (t1 - t0) + "ns")
> result
> }
> 
> Could I use something like this?? I guess that the System.nanoTime will be 
> executed in the driver before and after the workers execute the maps/joins 
> and so on. Is it right? any other idea?