Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-23 Thread kant kodali
Quick Question regarding the same topic. If construct the data frame *df* in the following way key | value | topic| offset | partition | timestamp foo1 | value1 | topic1 | ... foo2 | value2 | topic2 | ... what

Re: strange behavior of joining dataframes

2018-03-23 Thread Shiyuan
Here is a simple example that reproduces the problem. This code has a missing attribute('kk') error. Is it a bug? Note that if the `select` in line B is removed, this code would run. import pyspark.sql.functions as F df =

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Takeshi Yamamuro
Can you file a jira if this is a bug? Thanks! On Sat, Mar 24, 2018 at 1:23 AM, Michael Shtelma wrote: > Hi Maropu, > > the problem seems to be in FilterEstimation.scala on lines 50 and 52: > https://github.com/apache/spark/blob/master/sql/ >

Re: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Case class

2018-03-23 Thread Yong Zhang
I am still stuck with this. Anyone knows the correct way to use the custom Aggregator for the case class in agg way? I like to use Dataset API, but it looks like in aggregation, Spark lost the Type, and back to GenericRowWithSchema, instead of my case class. Is that right? Thanks

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-23 Thread Fawze Abujaber
Quick question: how to add the --jars /path/to/sparklens_2.11-0.1.0.jar to the spark-default conf, should it be using: spark.driver.extraClassPath /path/to/sparklens_2.11-0.1.0.jar or i should use spark.jars option? anyone who could give an example how it should be, and if i the path for the

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
Hi Maropu, the problem seems to be in FilterEstimation.scala on lines 50 and 52: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala?utf8=✓#L50-L52 val filterSelectivity =

Apache Spark Structured Streaming - How to keep executor alive.

2018-03-23 Thread M Singh
Hi: I am working on spark structured streaming (2.2.1) with kafka and want 100 executors to be alive. I set spark.executor.instances to be 100.  The process starts running with 100 executors but after some time only a few remain which causes backlog of events from kafka.  I thought I saw a

[Spark Core] details of persisting RDDs

2018-03-23 Thread Stefano Pettini
Hi, couple of questions about the internals of the persist mechanism (RDD, but maybe applicable also to DS/DF). Data is processed stage by stage. So what actually runs in worker nodes is the calculation of the partitions of the result of a stage, not the single RDDs. Operation of all the RDDs

ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-03-23 Thread Eirik Thorsnes
Hi all, I'm trying the new ORC native in Spark 2.3 (org.apache.spark.sql.execution.datasources.orc). I've compiled Spark 2.3 from the git branch-2.3 as of March 20th. I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4. *NOTE*: the error only occurs with zlib compression, and

Re: how to use lit() in spark-java

2018-03-23 Thread Anthony, Olufemi
You can us import static to import it directly: import static org.apache.spark.sql.functions.lit; Femi From: 崔苗 Date: Friday, March 23, 2018 at 8:34 AM To: "user@spark.apache.org" Subject: how to use lit() in spark-java Hi Guys, I want to add a

Re: how to use lit() in spark-java

2018-03-23 Thread Anil Langote
You have import functions dataset.withColumn(columnName,functions.lit("constant")) Thank you Anil Langote Sent from my iPhone _ From: 崔苗 Sent: Friday, March 23, 2018 8:33 AM Subject: how to use lit() in spark-java To: Hi

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Takeshi Yamamuro
hi, What's a query to reproduce this? It seems when casting double to BigDecimal, it throws the exception. // maropu On Fri, Mar 23, 2018 at 6:20 PM, Michael Shtelma wrote: > Hi all, > > I am using Spark 2.3 with activated cost-based optimizer and a couple > of hive

how to use lit() in spark-java

2018-03-23 Thread 崔苗
Hi Guys, I want to add a constant column to dataset by lit function in java, like that: dataset.withColumn(columnName,lit("constant")) but it's seems that idea coundn't found the lit() function,so how to use lit() function in java? thanks for any reply

Re: Spark and Accumulo Delegation tokens

2018-03-23 Thread Saisai Shao
It is in yarn module. "org.apache.spark.deploy.yarn.security.ServiceCredentialProvider". 2018-03-23 15:10 GMT+08:00 Jorge Machado : > Hi Jerry, > > where do you see that Class on Spark ? I only found > HadoopDelegationTokenManager > and I don’t see any way to add my Provider into

[Spark Core] Getting the number of stages a job is made of

2018-03-23 Thread Stefano Pettini
Hi everybody, this is my first message to the mailing list. In a world of DataFrames and Structured Streaming my use cases may be considered kind of corner cases, but still I think it's important to address such problems and go deep in understanding how Spark RDDs work. We have an application

Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
Hi all, I am using Spark 2.3 with activated cost-based optimizer and a couple of hive tables, that were analyzed previously. I am getting the following exception for different queries: java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:494) at

Calculate co-occurring terms

2018-03-23 Thread Donni Khan
Hi, I have a collection of text documents, I extracted the list of significat terms from that collection. I want to calculate co-occurance matrix for the extracted terms by using spark. I actually stored the the collection of text document in a DataFrame, StructType schema = *new*

Re: Spark and Accumulo Delegation tokens

2018-03-23 Thread Jorge Machado
Hi Jerry, where do you see that Class on Spark ? I only found HadoopDelegationTokenManager and I don’t see any way to add my Provider into it. private def getDelegationTokenProviders: Map[String, HadoopDelegationTokenProvider] = { val providers = List(new

Re: Spark and Accumulo Delegation tokens

2018-03-23 Thread Saisai Shao
I think you can build your own Accumulo credential provider as similar to HadoopDelegationTokenProvider out of Spark, Spark already provided an interface "ServiceCredentialProvider" for user to plug-in customized credential provider. Thanks Jerry 2018-03-23 14:29 GMT+08:00 Jorge Machado

Spark and Accumulo Delegation tokens

2018-03-23 Thread Jorge Machado
Hi Guys, I’m on the middle of writing a spark Datasource connector for Apache Spark to connect to Accumulo Tablets, because we have Kerberos it get’s a little trick because Spark only handles the Delegation Tokens from Hbase, hive and hdfs. Would be a PR for a implementation of

Re: Structured Streaming Spark 2.3 Query

2018-03-23 Thread Bowden, Chris
Use a streaming query listener that tracks repetitive progress events for the same batch id. If x amount of time has elapsed given repetitive progress events for the same batch id, the source is not providing new offsets and stream execution is not scheduling new micro batches. See also: