How to start two Workers connected to two different masters
I have two java applications sharing the same spark cluster, the applications should be running on different servers. Based on my experience, if spark driver (inside java application) connects remotely to spark master (which is running on different node), then the response time to submit a job would be increased significantly in compare with the scenario that app are running on the same node. So i need to have two spark master beside my applications. On the other hand i don't want to waste resources by dividing the spark cluster to two separate groups (two clusters that phisicaly separated ), therefore i want to start two slaves on each node of cluster, each (slave) connecting to one of masters and each could access all of the cores. The problem is that spark won't let me. First it said that you already have a worker, "stop it first". I set WORKER_INSTANCES to 2, but it creates two slave to one of the masters on first try! Sent using https://www.zoho.com/mail/
Fwd: Re: spark-sql force parallel union
Thanks Kathleen, 1. So if i've got 4 df's and i want "dfv1 union dfv2 union dfv3 union dfv4", would it first compute "dfv1 union dfv2" and "dfv3 union dfv4" independently and simultaneously? then union their results? 2. Its going to be hundreds of partitions to union, creating a temp view for each of them might be slow? Sent using Zoho Mail Forwarded message From : kathleen li To : Cc : Date : Wed, 21 Nov 2018 10:16:21 +0330 Subject : Re: spark-sql force parallel union Forwarded message you might first write the code to construct query statement with "union all" like below: scala> val query="select * from dfv1 union all select * from dfv2 union all select * from dfv3" query: String = select * from dfv1 union all select * from dfv2 union all select * from dfv3 then write loop to register each partition to a view like below: for (i <- 1 to 3){ df.createOrReplaceTempView("dfv"+i) } scala> spark.sql(query).explain == Physical Plan == Union :- LocalTableScan [_1#0, _2#1, _3#2] :- LocalTableScan [_1#0, _2#1, _3#2] +- LocalTableScan [_1#0, _2#1, _3#2] You can use " roll up" or "group set" for multiple dimension to replace "union" or "union all" On Tue, Nov 20, 2018 at 8:34 PM onmstester onmstester wrote: I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've partitioned my data with time bucket and one id, so based on queries i need to union multiple partitions with spark-sql and do the aggregations/group-by on union-result, something like this: for(all cassandra partitions){ DataSet currentPartition = sqlContext.sql(); unionResult = unionResult.union(currentPartition); } Increasing input (number of loaded partitions), increases response time more than linearly because unions would be done sequentialy. Because there is no harm in doing unions in parallel, and i dont know how to force spark to do them in parallel, Right now i'm using a ThreadPool to Asyncronosly load all partitions in my application (which may cause OOM), and somehow do the sort or simple group by in java (Which make me think why even i'm using spark at all?) The short question is: How to force spark-sql to load cassandra partitions in parallel while doing union on them? Also I don't want too many tasks in spark, with my Home-Made Async solution, i use coalesece(1) so one task is so fast (only wait time on casandra). Sent using Zoho Mail
spark-sql force parallel union
I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've partitioned my data with time bucket and one id, so based on queries i need to union multiple partitions with spark-sql and do the aggregations/group-by on union-result, something like this: for(all cassandra partitions){ DataSet currentPartition = sqlContext.sql(); unionResult = unionResult.union(currentPartition); } Increasing input (number of loaded partitions), increases response time more than linearly because unions would be done sequentialy. Because there is no harm in doing unions in parallel, and i dont know how to force spark to do them in parallel, Right now i'm using a ThreadPool to Asyncronosly load all partitions in my application (which may cause OOM), and somehow do the sort or simple group by in java (Which make me think why even i'm using spark at all?) The short question is: How to force spark-sql to load cassandra partitions in parallel while doing union on them? Also I don't want too many tasks in spark, with my Home-Made Async solution, i use coalesece(1) so one task is so fast (only wait time on casandra). Sent using Zoho Mail
Fwd: How to avoid long-running jobs blocking short-running jobs
You could have used two separate pools with different weights for ETL and rest jobs, when ETL pool weights is about 1 and Rest weight is 1000, anytime a Rest Job comes in, it allocate all the resources. Details: https://spark.apache.org/docs/latest/job-scheduling.html Sent using Zoho Mail Forwarded message From : conner To : Date : Sat, 03 Nov 2018 12:34:01 +0330 Subject : How to avoid long-running jobs blocking short-running jobs Forwarded message Hi, I use spark cluster to run ETL jobs and analysis computation about the data after elt stage. The elt jobs can keep running for several hours, but analysis computation is a short-running job which can finish in a few seconds. The dilemma I entrapped is that my application runs in a single JVM and can't be a cluster application, so just one spark context in my application currently. But when the elt jobs are running, the jobs will occupy all resource including worker executors too long to block all my analysis computation jobs. My solution is to find a good way to divide the spark cluster resource into two. One part for analysis computation jobs, another for elt jobs. if the part for elt jobs is free, I can allocate analysis computation jobs to it. So I want to find a middleware that can support two spark context and it must be embedded in my application. I do some research on the third party project spark job server. It can divide spark resource by launching another JVM to run spark context with a specific resource. these operations are invisible to the upper layer, so it's a good solution for me. But this project is running in a single JVM and just support REST API, I can't endure the data transfer by TCP again which too slow to me. I want to get a result from spark cluster by TCP and give this result to view layer to show. Can anybody give me some good suggestion? I shall be so grateful. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fwd: use spark cluster in java web service
Refer: https://spark.apache.org/docs/latest/quick-start.html 1. Create a singleton SparkContext at initialization of your cluster, the spark-context or spark-sql would be accessible through a static method anywhere in your application. I recommend using Fair scheduling on your context, to share resources among all input requests SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate(); 2. From now on, with sc or spark-sql object, something like sparkSql.sql("select * from test").collectAsList() would be run as a spark job and returns result to your application Sent using Zoho Mail Forwarded message From : 崔苗(数据与人工智能产品开发部) <0049003...@znv.com> To : "user" Date : Thu, 01 Nov 2018 10:52:15 +0330 Subject : use spark cluster in java web service Forwarded message Hi, we want to use spark in our java web service , compute data in spark cluster according to request,now we have two probles: 1、 how to get sparkSession of remote spark cluster (spark on yarn mode) , we want to keep one sparkSession to execute all data compution; 2、how to submit to remote spark cluster in java code instead of spark-submit , as we want to execute spark code in reponse server; Thanks for any replys 0049003208 0049003...@znv.com 签名由 网易邮箱大师 定制 - To unsubscribe e-mail:user-unsubscr...@spark.apache.org
Fwd: Having access to spark results
What about using cache() or save as a global temp table for subsequent access? Sent using Zoho Mail Forwarded message From : Affan Syed To : "spark users" Date : Thu, 25 Oct 2018 10:58:43 +0330 Subject : Having access to spark results Forwarded message Spark users, We really would want to get an input here about how the results from a Spark Query will be accessible to a web-application. Given Spark is a well used in the industry I would have thought that this part would have lots of answers/tutorials about it, but I didnt find anything. Here are a few options that come to mind 1) Spark results are saved in another DB ( perhaps a traditional one) and a request for query returns the new table name for access through a paginated query. That seems doable, although a bit convoluted as we need to handle the completion of the query. 2) Spark results are pumped into a messaging queue from which a socket server like connection is made. What confuses me is that other connectors to spark, like those for Tableau, using something like JDBC should have all the data (not the top 500 that we typically can get via Livy or other REST interfaces to Spark). How do those connectors get all the data through a single connection? Can someone with expertise help in bringing clarity. Thank you. Affan ᐧ ᐧ
Re: Spark In Memory Shuffle
create the ramdisk: mount tmpfs /mnt/spark -t tmpfs -o size=2G then point spark.local.dir to the ramdisk, which depends on your deployment strategy, for me it was through SparkConf object before passing it to SparkContext: conf.set("spark.local.dir","/mnt/spark") To validate that spark is actually using your ramdisk (by default it uses /tmp), ls the ramdisk after running some jobs and you should see spark directories (with date on directory name) on your ramdisk Sent using Zoho Mail On Wed, 17 Oct 2018 18:57:14 +0330 ☼ R Nair wrote What are the steps to configure this? Thanks On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester wrote: Hi, I failed to config spark for in-memory shuffle so currently just using linux memory mapped directory (tmpfs) as working directory of spark, so everything is fast Sent using Zoho Mail
Re: Spark In Memory Shuffle
Hi, I failed to config spark for in-memory shuffle so currently just using linux memory mapped directory (tmpfs) as working directory of spark, so everything is fast Sent using Zoho Mail On Wed, 17 Oct 2018 16:41:32 +0330 thomas lavocat wrote Hi everyone, The possibility to have in memory shuffling is discussed in this issue https://github.com/apache/spark/pull/5403. It was in 2015. In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still shuffle using disks. I would like to know : What is the current state of in memory shuffling ? Is it implemented in production ? Does the current shuffle still use disks to work ? Is it possible to somehow do it in RAM only ? Regards, Thomas - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
createorreplacetempview cause memory leak
I'm loading some json files in a loop, deserialize them in a list of objects and create a temp table from the list, run a select on table (repeat this for every file): for(jsonFile : allJsonFiles){ sqlcontext.sql("select * from mainTable").filter(").createOrReplaceTempView("table1"); sqlcontext.createDataFram(serializedObjectList, MyObject.class).createOrReplaceTempView("table2"); sqlqcontext.sql("select * from table1 join table2 ").collectAsList(); } after processing 30 json files my application crashes with OOM, if i add this two line at the end of for-loop: sql.dropTempTable("table1"); sql.dropTempTable("table2"); then My app continue to process all 500 json files with no problem, I've used createOrReplaceTempView many times in my problem, should i drop temp tables everywhere to free memory? Sent using Zoho Mail
How to set spark.driver.memory?
I have a spark cluster containing 3 nodes and my application is a jar file running by java -jar . How can i set driver.memory for my application? spark-defaults.conf only would be read by ./spark-summit "java --driver-memory -jar " fails with exception. Sent using Zoho Mail
enable jmx in standalone mode
How to enable jmx for spark worker/executor/driver in standalone mode? i have add these: spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=0\ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false to spark/conf/spark-defaults.conf run stop-slave.sh and then start-slave.sh using netstat -anop|grep executor-pid there is no port other than spark api port associated to process Sent using Zoho Mail
spark optimized pagination
Hi, I'm using spark on top of cassandra as backend CRUD of a Restfull Application. Most of Rest API's retrieve huge amount of data from cassandra and doing a lot of aggregation on them in spark which take some seconds. Problem: sometimes the output result would be a big list which make client browser throw stop script, so we should paginate the result at the server-side, but it would be so annoying for user to wait some seconds on each page to cassandra-spark processings, Current Dummy Solution: For now i was thinking about assigning a UUID to each request which would be sent back and forth between server-side and client-side, the first time a rest API invoked, the result would be saved in a temptable and in subsequent similar requests (request for next pages) the result would be fetch from temptable (instead of common flow of retrieve from cassandra + aggregation in spark which would take some time). On memory limit, the old results would be deleted. Is there any built-in clean caching strategy in spark to handle such scenarios? Sent using Zoho Mail
spark sql in-clause problem
I'm reading from this table in cassandra: Table mytable ( Integer Key, Integer X, Interger Y Using: sparkSqlContext.sql(select * from mytable where key = 1 and (X,Y) in ((1,2),(3,4))) Encountered error: StructType(StructField((X,IntegerType,true),StructField((Y,IntegerType,true)) != StructType(StructField((X,IntegerType,false),StructField((Y,IntegerType,false)) Sent using Zoho Mail
Scala's Seq:* equivalent in java
I could not find how to pass a list to isin() filter in java, something like this could be done with scala: val ids = Array(1,2) df.filter(df("id").isin(ids:_*)).show But in java everything that converts java list to scala Seq fails with unsupported literal type exception: JavaConversions.asScalaBuffer(list).toSeq() JavaConverters.asScalaIteratorConverter(list.iterator()).asScala().toSeq().seq() Sent using Zoho Mail
spark sql StackOverflow
Hi, I need to run some queries on huge amount input records. Input rate for records are 100K/seconds. A record is like (key1,key2,value) and the application should report occurances of kye1 = something key2 == somethingElse. The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered. I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table: select * from mytemptable where (kye1 = something key2 == somethingElse) or (kye1 = someOtherthing key2 == someAnotherThing) or ...(3thousands or!!!) And i encounter StackOverFlow at ATNConfigSet.java line 178. So i have two options IMHO: 1. Either put all key1 and key2 filter pairs in another temp table and do a join between two temp table 2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters. Which way do you suggest? what is the best solution for my problem 'performance-wise'? Thanks in advance