Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Nick Pentreath
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently. The reason

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Weichen Xu
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP. Thanks! On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath wrote: > For now, you must follow this approach of constructing a pipeline > consisting of a StringIndexer for each categorical column.

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Md. Rezaul Karim
Hi Nick, Both approaches worked and I realized my silly mistake too. Thank you so much. @Xu, thanks for the update. Best, Regards, _ *Md. Rezaul Karim*, BSc, MSc Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA

Programmatically get status of job (WAITING/RUNNING)

2017-10-30 Thread Behroz Sikander
Hi, I have a Spark Cluster running in client mode. I programmatically submit jobs to spark cluster. Under the hood, I am using spark-submit. If my cluster is overloaded and I start a context, the driver JVM keeps on waiting for executors. The executors are in waiting state because cluster does

Re: Anyone knows how to build and spark on jdk9?

2017-10-30 Thread Steve Loughran
On 27 Oct 2017, at 19:24, Sean Owen > wrote: Certainly, Scala 2.12 support precedes Java 9 support. A lot of the work is in place already, and the last issue is dealing with how Scala closures are now implemented quite different with lambdas /

spark sql truncate function

2017-10-30 Thread David Hodeffi
I saw that it is possible to truncate date function with MM or YY but it is not possible to truncate by WEEK ,HOUR, MINUTE. Am I right? Is there any objection to support it or it is just not implemented yet. Thanks David Confidentiality: This communication and any attachments are intended

share datasets across multiple spark-streaming applications for lookup

2017-10-30 Thread roshan joe
Hi, What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. Streaming App-1 consumes data from

Question regarding cached partitions

2017-10-30 Thread Alex Sulimanov
Hi, I started Spark Streaming job with 96 executors which reads from 96 Kafka partitions and applies mapWithState on the incoming DStream. Why would it cache only 77 partitions? Do I have to allocate more memory? Currently each executor gets 10 GB and it is not clear why it can't cache all 96

RE: Split column with dynamic data

2017-10-30 Thread Aakash Basu
Hey buddy, Thanks a TON! Issue resolved. Thanks again, Aakash. On 30-Oct-2017 11:44 PM, "Hondros, Constantine (ELS-AMS)" < c.hond...@elsevier.com> wrote: > You should just use regexp_replace to remove all the leading number > information (assuming it ends with a full-stop, and catering for

Split column with dynamic data

2017-10-30 Thread Aakash Basu
Hi all, I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description - Eg - (Description is the column header) *Description* Inventory Tree Products 1. AT Services 2. Accessories 4.

Getting RabbitMQ Message Delivery Tag (Stratio/spark-rabbitmq)

2017-10-30 Thread Daniel de Oliveira Mantovani
Hello, I'm using Stratio/spark-rabbitmq to read messages from RabbitMQ and save to Kafka, and I just want "commit" the RabbitMQ message when it's safe on Kafka's broker. For efficiency propose I'm using Kafka buffer and a call back object, which should has the RabbitMQ message Delivery Tag to

executors processing tasks sequentially

2017-10-30 Thread Ramanan, Buvana (Nokia - US/Murray Hill)
Hi All, Using Spark 2.2.0 on YARN cluster. I am running the Kafka Direct Stream wordcount example code (pasted below my signature). My topic consists of 400 partitions. And the Spark Job tracker page shows 26 executors to process the corresponding 400 tasks. When I check the execution

The parameter spark.yarn.executor.memoryOverhead

2017-10-30 Thread Ashok Kumar
Hi Gurus, The parameter spark.yarn.executor.memoryOverhead is explained as below: spark.yarn.executor.memoryOverhead executorMemory * 0.10, with minimum of 384 The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads,

RE: Split column with dynamic data

2017-10-30 Thread Hondros, Constantine (ELS-AMS)
You should just use regexp_replace to remove all the leading number information (assuming it ends with a full-stop, and catering for the possibility of a capital letter). This is untested, but it shoud do the trick based on your examples so far: df.withColumn(“new_column”,