Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-20 Thread kant kodali
Hi All, Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter? I see the following code in this link and I can see that database name can be passed in the connection string, however, I

Re: Lag and queued up batches info in Structured Streaming UI

2018-06-20 Thread Tathagata Das
Also, you can get information about the last progress made (input rates, etc.) from StreamingQuery.lastProgress, StreamingQuery.recentProgress, and using StreamingQueryListener. Its all documented -

Re: Lag and queued up batches info in Structured Streaming UI

2018-06-20 Thread Tathagata Das
Structured Streaming does not maintain a queue of batch like DStream. DStreams used to cut off batches at a fixed interval and put in a queue, and a different thread processed queued batches. In contrast, Structured Streaming simply cuts off and immediately processes a batch after the previous

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
Thanks for coming back with the solution! Sorry my suggestion did not help Daniel On Wed, 20 Jun 2018, 21:46 mattl156, wrote: > Alright so I figured it out. > > When reading from and writing to Hive metastore Parquet tables, Spark SQL > will try to use its own Parquet support instead of Hive

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread mattl156
Alright so I figured it out. When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. And so setting things like the below have no impact.

Lag and queued up batches info in Structured Streaming UI

2018-06-20 Thread SRK
hi, How do we get information like lag and queued up batches in Structured streaming? Following api does not seem to give any info about lag and queued up batches similar to DStreams. https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/streaming/scheduler/BatchInfo.html Thanks!

Re: Blockmgr directories intermittently not being cleaned up

2018-06-20 Thread tBoyle
I'm experiencing the same behaviour with shuffle data being orphaned on disk (Spark 2.0.1 with Spark streaming). We are using AWS R4 EC2 instances with 300GB EBS volumes attached, most spilled shuffle data is eventually cleaned up by the ContextCleaner within 10 minutes. We do not use the

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
If it is difficult to create the small stand alone program, another approach seems to attach everything (i.e. configuration, data, program, console output, log, history server data, etc.) As a log, the community would recommend the info log with "spark.sql.codegen.logging.maxLines=2147483647".

Re: spark kudu issues

2018-06-20 Thread William Berkeley
What you're seeing is the expected behavior in both cases. One way to achieve the semantics you want in both situations is to read in the Kudu table to a data frame, then filter it in Spark SQL to contain just the rows you want to delete, and then use that dataframe to do the deletion. There

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread mattl156
Thanks. Unfortunately I dont have control over how data is inserted and the table is not partitioned. The reason the sub directories are being created is because when Tez does an INSERT into a table from a UNION query it creates sub directories so that it can write in parallel. I've also

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Aakash Basu
Hi Kazuaki, It would be really difficult to produce a small S-A code to reproduce this problem because, I'm running through a big pipeline of feature engineering where I derive a lot of variables based on the present ones to kind of explode the size of the table by many folds. Then, when I do any

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
Spark 2.3 tried to split a large generated Java methods into small methods as possible. However, this report may remain places that generates a large method. Would it be possible to create a JIRA entry with a small stand alone program that can reproduce this problem? It would be very helpful

spark kudu issues

2018-06-20 Thread Pietro Gentile
Hi all, I am currently evaluating using Spark with Kudu. So I am facing the following issues: 1) If you try to DELETE a row with a key that is not present on the table you will have an Exception like this: java.lang.RuntimeException: failed to write N rows from DataFrame to Kudu; sample errors:

Re: G1GC vs ParallelGC

2018-06-20 Thread vaquar khan
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html Regards, Vaquar khan On Wed, Jun 20, 2018, 1:18 AM Aakash Basu wrote: > Hi guys, > > I just wanted to know, why my ParallelGC (*--conf > "spark.executor.extraJavaOptions=-XX:+UseParallelGC"*) in a

Re: load hbase data using spark

2018-06-20 Thread vaquar khan
Why you need tool,you can directly connect Hbase using spark. Regards, Vaquar khan On Jun 18, 2018 4:37 PM, "Lian Jiang" wrote: Hi, I am considering tools to load hbase data using spark. One choice is https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to be out-of-date

Re: Spark application complete it's job successfully on Yarn cluster but yarn register it as failed

2018-06-20 Thread Sonal Goyal
Have you checked the logs - they probably should have some more details. On Wed 20 Jun, 2018, 2:51 PM Soheil Pourbafrani, wrote: > Hi, > > I run a Spark application on Yarn cluster and it complete the process > successfully, but at the end Yarn print in the console: > > client token: N/A >

Apache Spark use case: correlate data strings from file

2018-06-20 Thread darkdrake
Hi, I’m new on Spark and I’m trying to understand if it can fit my use case. I have the following scenario. I have a file (it can be a log file, .txt, .csv, .xml or .json, I can produce the data in whatever format I prefer) with some data, e.g.: *Event “X”, City “Y”, Zone “Z”* with different

Spark application complete it's job successfully on Yarn cluster but yarn register it as failed

2018-06-20 Thread Soheil Pourbafrani
Hi, I run a Spark application on Yarn cluster and it complete the process successfully, but at the end Yarn print in the console: client token: N/A diagnostics: Application application_1529485137783_0004 failed 4 times due to AM Container for appattempt_1529485137783_0004_04 exited with

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
Hi Matt, What I tend to do is partition by date in the following way: s3://data-lake/pipeline1/extract_year=2018/extract_month=06/extract_day=20/file1.json See the pattern is key=value for physical partitions When you read that like this: spark.read.json("s3://data-lake/pipeline1/") It will

[Spark Streaming] Are SparkListener/StreamingListener callbacks called concurrently?

2018-06-20 Thread Majid Azimi
Hi, What is the concurrency model behind SparkListener or StreamingListener callbacks? 1. Multiple threads might access callbacks simultaneously. 2. Callbacks are guaranteed to be executed by a single thread.(Thread ids might change on consecutive calls, though) I asked the same question on

Way to avoid CollectAsMap in RandomForest

2018-06-20 Thread Aakash Basu
Hi, I'm running RandomForest model from Spark ML API on a medium sized data (2.25 million rows and 60 features), most of my time goes in the CollectAsMap of RandomForest but I've no option to avoid it as it is in the API. Is there a way to cutshort my end to end runtime? Thanks, Aakash.

G1GC vs ParallelGC

2018-06-20 Thread Aakash Basu
Hi guys, I just wanted to know, why my ParallelGC (*--conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"*) in a very long Spark ML Pipeline works faster than when I set G1GC (*--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"*), even though the Spark community suggests G1GC to be much