Support for group aggregate pandas UDF in streaming aggregation for SPARK 3.0 python

2020-08-11 Thread Aesha Dhar Roy
Hi, Is there any plan to remove the limitation mentioned below? *Streaming aggregation doesn't support group aggregate pandas UDF * We want to run our data modelling jobs real time using Spark 3.0 and kafka 2.4 and need to have support for custom aggregate pandas UDF on stream windows. Is there

External hive metastore (remote) managed tables

2020-05-28 Thread Debajyoti Roy
Hi, anyone knows the behavior of dropping managed tables in case of external hive meta store: Deletion of the data (e.g. from object store) happens from Spark sql or, the external hive metastore ? Confused by local mode and remote mode codes.

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
> also can improve the existing CBO and make it more general. The paper of > Spark SQL was published 5 years ago. A lot of great contributions were made > in the past 5 years. > > Cheers, > > Xiao > > Debajyoti Roy 于2020年1月15日周三 上午9:23写道: > >> Thanks all, and Matei

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
Thanks all, and Matei. TL;DR of the conclusion for my particular case: Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima. Quantitatively, there is no

Spark Dataset transformations for time based events

2018-12-25 Thread Debajyoti Roy
-as-of-join-of-two-datasets-in-apache-spark 2. Snapshot of state with time to state with effective start and end time: https://stackoverflow.com/questions/53928372/given-dataset-of-state-snapshots-at-time-t-how-to-transform-it-into-dataset-with/53928400#53928400 Thanks in advance! Roy

Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread Debajyoti Roy
The problem statement and an approach to solve it using windows is described here: https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e Looking for more elegant/performant solutions, if they exist. TIA !

spark-submit config via file

2017-03-24 Thread , Roy
one know is this is even possible ? Thanks... Roy

spark-itemsimilarity No FileSystem for scheme error

2016-01-05 Thread roy
Hi we are using CDH 5.4.0 with Spark 1.5.2 (doesn't come with CDH 5.4.0) I am following this link https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html to trying to test/create new algorithm with mahout item-similarity. I am running following command ./bin/mahout

Error in load hbase on spark

2015-10-08 Thread Roy Wang
I want to load hbase table into spark. JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); *when call hBaseRDD.count(),got error.* Caused by: java.lang.IllegalStateException: The input format

python version in spark-submit

2015-10-01 Thread roy
Hi, We have python2.6 (default) on cluster and also we have installed python2.7. I was looking a way to set python version in spark-submit. anyone know how to do this ? Thanks -- View this message in context:

how to control timeout in node failure for spark task ?

2015-09-25 Thread roy
Hi, We are running Spark 1.3 on CDH 5.4.1 on top of YARN. we want to know how do we control task timeout when node fails and task running on it should be restarted on another node. at present job wait for approximately 10 min to restart the task were running on failed node.

pyspark driver in cluster rather than gateway/client

2015-09-10 Thread roy
Hi, Is there any way to make spark driver to run in side YARN containers rather than gateway/client machine. At present even with config parameters --master yarn & --deploy-mode cluster driver runs on gateway/client machine. We are on CDH 5.4.1 with YARN and Spark 1.3 any help on this ?

Re: The auxService:spark_shuffle does not exist

2015-07-07 Thread roy
we tried --master yarn-client with no different result. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

The auxService:spark_shuffle does not exist

2015-07-06 Thread roy
I am getting following error for simple spark job I am running following command /spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar/ but job doesn't show any

Yarn application ID for Spark job on Yarn

2015-06-22 Thread roy
Hi, Is there a way to get Yarn application ID inside spark application, when running spark Job on YARN ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html Sent from the Apache Spark User List

Spark job fails silently

2015-06-22 Thread roy
Hi, Our spark job on yarn suddenly started failing silently without showing any error following is the trace. Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default property:

spark on yarn failing silently

2015-06-22 Thread roy
Hi, suddenly our spark job on yarn started failing silently without showing any error, following is the trace in verbose mode Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default

spark java.io.FileNotFoundException: /user/spark/applicationHistory/application

2015-05-28 Thread roy
hi, Suddenly spark jobs started failing with following error Exception in thread main java.io.FileNotFoundException: /user/spark/applicationHistory/application_1432824195832_1275.inprogress (No such file or directory) full trace here [21:50:04 x...@hadoop-client01.dev:~]$ spark-submit --class

Re: Spark HistoryServer not coming up

2015-05-21 Thread roy
This got resolved after cleaning /user/spark/applicationHistory/* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-HistoryServer-not-coming-up-tp22975p22981.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark HistoryServer not coming up

2015-05-21 Thread roy
Hi, After restarting Spark HistoryServer, it failed to come up, I checked logs for Spark HistoryServer found following messages :' 2015-05-21 11:38:03,790 WARN org.apache.spark.scheduler.ReplayListenerBus: Log path provided contains no log files. 2015-05-21 11:38:52,319 INFO

How to process data in chronological order

2015-05-20 Thread roy
I have a key-value RDD, key is a timestamp (femto-second resolution, so grouping buys me nothing) and I want to reduce it in the chronological order. How do I do that in spark? I am fine with reducing contiguous sections of the set separately and then aggregating the resulting objects locally.

Possible to disable Spark HTTP server ?

2015-05-05 Thread roy
Hi, When we start spark job it start new HTTP server for each new job. Is it possible to disable HTTP server for each job ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html Sent from the Apache

spark.logConf with log4j.rootCategory=WARN

2015-05-01 Thread roy
Hi, I have recently enable log4j.rootCategory=WARN, console in spark configuration. but after that spark.logConf=True has becomes ineffective. So just want to confirm if this is because log4j.rootCategory=WARN ? Thanks -- View this message in context:

shuffle.FetchFailedException in spark on YARN job

2015-04-18 Thread roy
Hi, My spark job is failing with following error message org.apache.spark.shuffle.FetchFailedException: /mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index (No such file or directory) at

spark job progress-style report on console ?

2015-04-09 Thread roy
Hi, How do i get spark job progress-style report on console ? I tried to set --conf spark.ui.showConsoleProgress=true but it thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440.html Sent from the

Spark SQL Avro Library for 1.2

2015-04-08 Thread roy
How do I build Spark SQL Avro Library for Spark 1.2 ? I was following this https://github.com/databricks/spark-avro and was able to build spark-avro_2.10-1.0.0.jar by simply running sbt/sbt package from the project root. but we are on Spark 1.2 and need compatible spark-avro jar. Any idea how

Spark 1.3 on CDH 5.3.1 YARN

2015-04-08 Thread roy
Hi, We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current version in CDH5.3.2), But We want to try Spark 1.3 without breaking existing setup, so is it possible to have Spark 1.3 on existing setup ? Thanks -- View this message in context:

Re: can't union two rdds

2015-03-31 Thread roy
use zip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe,

Re: Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
-project.guava.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at org.spark-project.guava.common.cache.LocalCache.get(LocalCache.java:4000) thanks On Thu, Mar 26, 2015 at 7:27 PM, , Roy rp...@njit.edu wrote: We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2 Jobs link

Spark History Server : jobs link doesn't open

2015-03-26 Thread , Roy
We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2 Jobs link on spark History server doesn't open and shows following message : HTTP ERROR: 500 Problem accessing /history/application_1425934191900_87572. Reason: Server Error -- *Powered by

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread , Roy
do a *netstat -pnat | grep 404* *And see what all processes are running. Thanks Best Regards On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote: I get following message for each time I run spark job 1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector

FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-24 Thread , Roy
thanks roy

Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread , Roy
Hi, I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2 I am trying to run one spark job with following command PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G --num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics

What do you think about the level of resource manager and file system?

2015-02-11 Thread Fangqi (Roy)
[cid:image004.jpg@01D04629.1F451950] [cid:image005.png@01D04629.1F451950] Hi guys~ Comparing these two architectures, why BDAS put Yarn and Mesos under the HDFS, do you have any special consideration? Or just easy to express the AMPLab stack? Best regards!

unsubscribe

2014-05-05 Thread Shubhabrata Roy
unsubscribe