Re: Is there any tool provides per-task monitoring to figure out task skew in Spark streaming?

2015-09-28 Thread Tathagata Das
You can see the task details in the Spark UI to see how many bytes do each of the tasks in the skewed stages process. That would be good place to start. On Mon, Sep 28, 2015 at 7:59 PM, 이기석 wrote: > Hi this is a graduate student studying Spark streaming for research >

A non-canonical use of the Spark computation model

2015-09-28 Thread Blarvomere
The closest use case to what I will describe, I believe, is the real-time ad serving that Yahoo is doing. I am looking into using Spark as a sub-second latency decision engine service that a user-facing application calls, maybe via the Livy REST server or spark-jobserver. Instead of Terabytes of

Re: SparkContext._active_spark_context returns None

2015-09-28 Thread YiZhi Liu
Hi Ted, Thank you for reply. The sc works at driver, but how can I reach the JVM in rdd.map ? 2015-09-29 11:26 GMT+08:00 Ted Yu : sc._jvm.java.lang.Integer.valueOf("12") > 12 > > FYI > > On Mon, Sep 28, 2015 at 8:08 PM, YiZhi Liu wrote: >> >> Hi,

Re: Is MLBase dead?

2015-09-28 Thread Justin Pihony
To take a stab at my own answer: MLBase is now fully integrated into MLLib. MLI/MLLib are the mllib algorithms and MLO is the ml pipelines? On Mon, Sep 28, 2015 at 10:19 PM, Justin Pihony wrote: > As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and >

Re: spark-submit classloader issue...

2015-09-28 Thread Aniket Bhatnagar
Hi Rachna Can you just use http client provided via spark transitive dependencies instead of excluding them? The reason user classpath first is failing could be because you have spark artifacts in your assembly jar that dont match with what is deployed (version mismatch or you built the version

Re: flatmap() and spark performance

2015-09-28 Thread Hemant Bhanawat
You can use spark.executor.memory to specify the memory of the executors which will hold this intermediate results. You may want to look at the section "Understanding Memory Management in Spark" of this link:

RE: nested collection object query

2015-09-28 Thread Tridib Samanta
Thanks for you response Yong! Array syntax works fine. But I am not sure how to use explode. Should I use as follows? select id from department where explode(employee).name = 'employee0 This query gives me java.lang.UnsupportedOperationException . I am using HiveContext. From:

Re: SparkContext._active_spark_context returns None

2015-09-28 Thread Ted Yu
>>> sc._jvm.java.lang.Integer.valueOf("12") 12 FYI On Mon, Sep 28, 2015 at 8:08 PM, YiZhi Liu wrote: > Hi, > > I'm doing some data processing on pyspark, but I failed to reach JVM > in workers. Here is what I did: > > $ bin/pyspark > >>> data = sc.parallelize(["123",

Re: Spark SQL: Implementing Custom Data Source

2015-09-28 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTttmiYDqGc202 And: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources > On Sep 28, 2015, at 8:22 PM, Jerry Lam wrote: > > Hi spark users and developers, > > I'm trying to learn how implement a

Re: Setting executors per worker - Standalone

2015-09-28 Thread Jeff Zhang
use "--executor-cores 1" you will get 4 executors per worker since you have 4 cores per worker On Tue, Sep 29, 2015 at 8:24 AM, James Pirz wrote: > Hi, > > I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while > each machine has 12GB of RAM and 4 cores.

Re: Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
Thanks for your reply. Setting it as --conf spark.executor.cores=1 when I start spark-shell (as an example application) indeed sets the number of cores per executor as 1 (which is 4 before), but I still have 1 executor per worker. What I am really looking for is having 1 worker with 4 executor

Fwd: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-28 Thread Fernando Paladini
Hello guys, I'm very new to Spark and I'm having some troubles when reading a JSON to dataframe on PySpark. I'm getting a JSON object from an API response and I would like to store it in Spark as a DataFrame (I've read that DataFrame is better than RDD, that's accurate?). For what I've read

Re: GroupBy Java objects in Java Spark

2015-09-28 Thread Peter Bollerman
Hi, You will want to make sure your key JavaObject implements equals() and hashCode() appropriately. Otherwise you may not get the groupings you expect Peter Bollerman Principal Software Engineer The Allant Group, Inc. 630-718-3830 On Thu, Sep 24, 2015 at 5:27 AM, Sabarish Sasidharan <

Spark streaming job filling a lot of data in local spark nodes

2015-09-28 Thread swetha
Hi, I see a lot of data getting filled locally as shown below from my streaming job. I have my checkpoint set to hdfs. But, I still see the following data filling my local nodes. Any idea if I can make this stored in hdfs instead of storing the data locally? -rw-r--r-- 1520 Sep 17

RE: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-28 Thread Dominic Ricard
Thanks Cheng. We use the Thrift Server to connect to Spark SQL from a JDBC client. I finally found the solution. My issue was coming from my query, as I reference the column as a STRUCT instead of a MAP. Here's the fix: Original Query with issue: SELECT COUNT(*) FROM table WHERE

Re: CassandraSQLContext throwing NullPointer Exception

2015-09-28 Thread Priya Ch
Ted, I am using spark 1.3.0 and running the code in YARN mode. Here is the code.. object streaming { def main(args:Array[String]) { val conf = new SparkConfst conf..setMaster("yarn-client") conf.setAppName("SimpleApp") conf.set("spark.cassandra.coonection.host","") val sc = new

Re: SQL queries in Spark / YARN

2015-09-28 Thread Mark Hamstra
Yes. On Mon, Sep 28, 2015 at 12:46 PM, Robert Grandl wrote: > Hi guys, > > I was wondering if it's possible to submit SQL queries to Spark SQL, when > Spark is running atop YARN instead of standalone mode. > > Thanks, > Robert >

Re: SQL queries in Spark / YARN

2015-09-28 Thread Kartik Mathur
Hey Robert you could use Zeppelin iInstead If you don't want to use beeline . On Monday, September 28, 2015, Robert Grandl wrote: > Thanks Mark. Do you know how ? In Spark standalone mode I use beeline to > submit SQL scripts. > > In Spark/YARN, the only way I can see

Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Krishna Sankar
Thanks Guys. Yep, now I would install 1.5.1 over HDP 2.3, if that works. Cheers On Mon, Sep 28, 2015 at 9:47 AM, Ted Yu wrote: > Krishna: > If you want to query ORC files, see the following JIRA: > > [SPARK-10623] [SQL] Fixes ORC predicate push-down > > which is in the

SQL queries in Spark / YARN

2015-09-28 Thread Robert Grandl
Hi guys, I was wondering if it's possible to submit SQL queries to Spark SQL, when Spark is running atop YARN instead of standalone mode. Thanks,Robert

UnknownHostException with Mesos and custom Jar

2015-09-28 Thread Stephen Hankinson
erEndpoint: Registering block manager ip-172-31-4-4.us-west-2.compute.internal:50648 with 530.3 MB RAM, BlockManagerId(20150928-190245-1358962604-5050-11297-S2, ip-172-31-4-4.us-west-2.compute.internal, 50648) 15/09/28 20:04:52 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-37-82

Adding / Removing worker nodes for Spark Streaming

2015-09-28 Thread Augustus Hong
Hey all, I'm evaluating using Spark Streaming with Kafka direct streaming, and I have a couple of questions: 1. Would it be possible to add / remove worker nodes without stopping and restarting the spark streaming driver? 2. I understand that we can enable checkpointing to recover from node

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-28 Thread Kartik Mathur
Hey Rick , Not sure on this but similar situation happened with me, when starting spark-shell it was starting a new cluster instead of using the existing cluster and this new cluster was a single node cluster , that's why jobs were taking forever to complete from spark-shell and were running much

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-28 Thread Kartik Mathur
Ok, that might be possible , to confirm that you can explicitly specify the serializer in both cases (by setting this spark.serializer i guess). So then you can be sure that same serializers are used and may be then do an analysis. Best, Kartik On Mon, Sep 28, 2015 at 11:38 AM, Rick Moritz

Re: Python script runs fine in local mode, errors in other modes

2015-09-28 Thread Aaron
Hi! Yeah, the problem was inconsistent versions of python across the cluster. I was launching from a node with Python 2.7.9, but the job ran on nodes with 2.6.6. From: "devmacrile [via Apache Spark User List]"

Re: About memory leak in spark 1.4.1

2015-09-28 Thread Jon Chase
I'm seeing a similar (same?) problem on Spark 1.4.1 running on Yarn (Amazon EMR, Java 8). I'm running a Spark Streaming app 24/7 and system memory eventually gets exhausted after about 3 days and the JVM process dies with: # # There is insufficient memory for the Java Runtime Environment to

Re: how to handle OOMError from groupByKey

2015-09-28 Thread Alexis Gillain
"Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]]." Obvioulsy one of your key value pair is two large. You can try to increase spark.shuffle.memoryFraction. Are

Performance when iterating over many parquet files

2015-09-28 Thread jwthomas
We are working with use cases where we need to do batch processing on a large number (hundreds of thousands) of Parquet files. The processing is quite similar per file. There are a many aggregates that are very SQL-friendly (computing averages, maxima, minima, aggregations on single columns with

java.lang.ClassCastException (org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task)

2015-09-28 Thread amitra123
Hello All, I am trying to write a very simply Spark Streaming example problem and I m getting this exception. I am new to Spark and I am not quite sure why this exception is thrown. Wondering if someone has any clues. Here is the backtrace. I am running this on Spark 1.5.0.

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
Another note: for best performance you are going to want your parquet files to be pretty big (100s of mb). You could coalesce them and write them out for more efficient repeat querying. On Mon, Sep 28, 2015 at 2:00 PM, Michael Armbrust wrote: > sqlContext.read.parquet >

Re: spark.mesos.coarse impacts memory performance on mesos

2015-09-28 Thread Utkarsh Sengar
Hi Tim, 1. spark.mesos.coarse:false (fine grain mode) This is the data dump for config and executors assigned: https://gist.github.com/utkarsh2012/6401d5526feccab14687 2. spark.mesos.coarse:true (coarse grain mode) Dump for coarse mode: https://gist.github.com/utkarsh2012/918cf6f8ed5945627188

Re: Adding / Removing worker nodes for Spark Streaming

2015-09-28 Thread Sourabh Chandak
I also have the same use case as Augustus, and have some basic questions about recovery from checkpoint. I have a 10 node Kafka cluster and a 30 node Spark cluster running streaming job, how is the (topic, partition) data handled in checkpointing. The scenario I want to understand is, in case of

Reading kafka stream and writing to hdfs

2015-09-28 Thread Chengi Liu
Hi, I am going thru this example here: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py If I want to write this data on hdfs. Whats the right way to do this? Thanks

Re: Adding / Removing worker nodes for Spark Streaming

2015-09-28 Thread Cody Koeninger
If a node fails, the partition / offset range that it was working on will be scheduled to run on another node. This is generally true of spark, regardless of checkpointing. The offset ranges for a given batch are stored in the checkpoint for that batch. That's relevant if your entire job fails

Re: java.lang.ClassCastException (org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task)

2015-09-28 Thread Ted Yu
Please see SPARK-8142 On Mon, Sep 28, 2015 at 1:45 PM, amitra123 wrote: > Hello All, > > I am trying to write a very simply Spark Streaming example problem and I m > getting this exception. I am new to Spark and I am not quite sure why this > exception is thrown. Wondering

Re: HDFS small file generation problem

2015-09-28 Thread Jörn Franke
Use hadoop archive Le dim. 27 sept. 2015 à 15:36, a écrit : > Hello, > I'm still investigating my small file generation problem generated by my > Spark Streaming jobs. > Indeed, my Spark Streaming jobs are receiving a lot of small events (avg > 10kb), and I have to store them

Re: Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Ashish Soni
I am not running it using spark submit , i am running locally inside Eclipse IDE , how i set this using JAVA Code Ashish On Mon, Sep 28, 2015 at 10:42 AM, Adrian Tanase wrote: > You also need to provide it as parameter to spark submit > >

Spark REST Job server feedback?

2015-09-28 Thread Ramirez Quetzal
Anyone has feedback on using Hue / Spark Job Server REST servers? http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark/ https://github.com/spark-jobserver/spark-jobserver Many thanks, Rami

Re: Notification on Spark Streaming job failure

2015-09-28 Thread Chen Song
I am also interested specifically in monitoring and alerting on Spark streaming jobs. It will be helpful to get some general guidelines or advice on this, from people who implemented anything on this. On Fri, Sep 18, 2015 at 2:35 AM, Krzysztof Zarzycki wrote: > Hi there

Re: Update cassandra rows problem

2015-09-28 Thread Ted Yu
Please consider posting on DataStax's mailing list for question w.r.t. spark cassandra connector On Mon, Sep 28, 2015 at 6:59 AM, amine_901 wrote: > Hello all, > i'm using spark 1.2 with spark cassandra connector 1.2.3, > i'm trying to update somme rows of table:

Interactively search Parquet-stored data using Spark Streaming and DataFrames

2015-09-28 Thread Նարեկ Գալստեան
I have significant amount of data stored on my Hadoop HDFS as Parquet files I am using Spark streaming to interactively receive queries from a web server and transform the received queries into SQL to run on my data using SparkSQL. In this process I need to run several SQL queries and then return

Re: Weird worker usage

2015-09-28 Thread Bryan Jeffrey
Nukunj, No, I'm not calling set w/ master at all. This ended up being a foolish configuration problem with my slaves file. Regards, Bryan Jeffrey On Fri, Sep 25, 2015 at 11:20 PM, N B wrote: > Bryan, > > By any chance, are you calling SparkConf.setMaster("local[*]")

Re: Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Adrian Tanase
You also need to provide it as parameter to spark submit http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver From: Ashish Soni Date: Monday, September 28, 2015 at 5:18 PM To: user Subject: Spark Streaming Log4j Inside Eclipse I need to turn off the

Re: Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Shixiong Zhu
You can use JavaSparkContext.setLogLevel to set the log level in your codes. Best Regards, Shixiong Zhu 2015-09-28 22:55 GMT+08:00 Ashish Soni : > I am not running it using spark submit , i am running locally inside > Eclipse IDE , how i set this using JAVA Code > >

Re: how to handle OOMError from groupByKey

2015-09-28 Thread Fabien Martin
You can try to reduce the number of containers in order to increase their memory. 2015-09-28 9:35 GMT+02:00 Akhil Das : > You can try to increase the number of partitions to get ride of the OOM > errors. Also try to use reduceByKey instead of groupByKey. > > Thanks >

Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Ashish Soni
Hi All , I need to turn off the verbose logging of Spark Streaming Code when i am running inside eclipse i tried creating a log4j.properties file and placed inside /src/main/resources but i do not see it getting any effect , Please help as not sure what else needs to be done to change the log at

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-28 Thread Rick Moritz
I've finally been able to pick this up again, after upgrading to Spark 1.4.1, because my code used the HiveContext, which runs fine in the REPL (be it via Zeppelin or the shell) but won't work with spark-submit. With 1.4.1, I hav actually managed to get a result with the Spark shell, but after

ML Pipeline

2015-09-28 Thread Yasemin Kaya
Hi, I am using Spar 1.5 and ML Pipeline. I create the model then give the model unlabeled data to find the probabilites and predictions. When I want to see the results, it returns me error. //creating model final PipelineModel model = pipeline.fit(trainingData); JavaRDD rowRDD1 = unlabeledTest

laziness in textFile reading from HDFS?

2015-09-28 Thread davidkl
Hello, I need to process a significant amount of data every day, about 4TB. This will be processed in batches of about 140GB. The cluster this will be running on doesn't have enough memory to hold the dataset at once, so I am trying to understand how this works internally. When using textFile to

Re: HDFS is undefined

2015-09-28 Thread Akhil Das
For some reason Spark isnt picking up your hadoop confs, Did you download spark compiled with the hadoop version that you are having in the cluster? Thanks Best Regards On Fri, Sep 25, 2015 at 7:43 PM, Angel Angel wrote: > hello, > I am running the spark application. >

Re: how to handle OOMError from groupByKey

2015-09-28 Thread Akhil Das
You can try to increase the number of partitions to get ride of the OOM errors. Also try to use reduceByKey instead of groupByKey. Thanks Best Regards On Sat, Sep 26, 2015 at 1:05 AM, Elango Cheran wrote: > Hi everyone, > I have an RDD of the format (user: String,

"recommendProductsForUsers" makes worker node crash

2015-09-28 Thread wanbo
I have two workers to run the recommendation job. After spark v1.4.0, I want to try the method "recommendProductsForUsers". This method makes my workers node crash, and timeout to connect. If don't add new worker node. What should I do? -- View this message in context:

Re: HDFS is undefined

2015-09-28 Thread Ted Yu
Please post the question on vendor's forum. > On Sep 25, 2015, at 7:13 AM, Angel Angel wrote: > > hello, > I am running the spark application. > > I have installed the cloudera manager. > it includes the spark version 1.2.0 > > > But now i want to use spark version

Master getting down with Memory issue.

2015-09-28 Thread Saurav Sinha
Hi Spark Users, I am running some spark jobs which is running every hour.After running for 12 hours master is getting killed giving exception as *java.lang.OutOfMemoryError: GC overhead limit exceeded* It look like there is some memory issue in spark master. Spark Master is blocker. Any one

Re: Master getting down with Memory issue.

2015-09-28 Thread Akhil Das
Depends on the data volume that you are operating on. Thanks Best Regards On Mon, Sep 28, 2015 at 5:12 PM, Saurav Sinha wrote: > Hi Akhil, > > My job is creating 47 stages in one cycle and it is running every hour. > Can you please suggest me what is optimum numbers of

log4j Spark-worker performance problem

2015-09-28 Thread vaibhavrtk
Hello We need a lot of logging for our application about 1000 lines needed to be logged per message we process and we process 1000 msgs/sec. So total lines needed to be logged is /1000*1000/sec/. As it is going to be written in a file. Will writing so much logs will impact the processing power of

Re: Spark 1.5.0 Not able to submit jobs using cluster URL

2015-09-28 Thread Akhil Das
Update the dependency version in your jobs build file, Also make sure you have updated the spark version to 1.5.0 everywhere. (in the cluster, code) Thanks Best Regards On Mon, Sep 28, 2015 at 11:29 AM, lokeshkumar wrote: > Hi forum > > I have just upgraded spark from 1.4.0

Re: using multiple dstreams together (spark streaming)

2015-09-28 Thread Archit Thakur
@TD: Doesn't transformWith need both of the DStreams to be of same slideDuration. [Spark Version: 1.3.1] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-multiple-dstreams-together-spark-streaming-tp9947p24839.html Sent from the Apache Spark User List

Re: Master getting down with Memory issue.

2015-09-28 Thread Saurav Sinha
Hi Akhil, Can you please explaine to me how increasing number of partition (which is thing is worker nodes) will help. As issue is that my master is getting OOM. Thanks, Saurav Sinha On Mon, Sep 28, 2015 at 2:32 PM, Akhil Das wrote: > This behavior totally depends

Re: Master getting down with Memory issue.

2015-09-28 Thread Akhil Das
This behavior totally depends on the job that you are doing. Usually increasing the # of partitions will sort out this issue. It would be good if you can paste the code snippet or explain what type of operations that you are doing. Thanks Best Regards On Mon, Sep 28, 2015 at 11:37 AM, Saurav

Re: Spark 1.5.0 Not able to submit jobs using cluster URL

2015-09-28 Thread Akhil Das
Well, for some reason your test is picking up the older jar then. Best way to sort it out would be to create a build file for your project and add the dependencies in the build file rather than you manually putting the jars. Thanks Best Regards On Mon, Sep 28, 2015 at 2:44 PM, Lokesh Kumar

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
I guess you're probably using Spark 1.5? Spark SQL does support schema merging, but we disabled it by default since 1.5 because it introduces extra performance costs (it's turned on by default in 1.4 and 1.3). You may enable schema merging via either the Parquet data source specific option

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Ah, yes, I see that it has been turned off now, that’s why it wasn’t working. Thank you, this is helpful! The problem now is to filter out bad (miswritten) Parquet files, as they are causing this operation to fail. Any suggestions on detecting them quickly and easily? From: Cheng Lian

Get variable into Spark's foreachRDD function

2015-09-28 Thread markluk
I have a streaming Spark process and I need to do some logging in the `foreachRDD` function, but I'm having trouble accessing the logger as a variable in the `foreachRDD` function I would like to do the following import logging myLogger = logging.getLogger(LOGGER_NAME) ... ...

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Sure. FI would just like to remove ones that fail the basic checks done by the Parquet readFooters function, in that their length is wrong or magic number is incorrect, which throws exceptions in the read method. Errors like: java.io.IOException: Could not read footer:

RE: java.lang.ClassCastException (org.apache.spark.scheduler.ResultTask cannot be cast to org.apache.spark.scheduler.Task)

2015-09-28 Thread Amit Yadav
Thank you Ted. I followed SPARK-8142 and removed the spark-core and hadoop jars from my application uber jar. I am able to get past through this error now. Amit. Date: Mon, 28 Sep 2015 14:11:18 -0700 Subject: Re: java.lang.ClassCastException (org.apache.spark.scheduler.ResultTask cannot be

Monitoring tools for spark streaming

2015-09-28 Thread Siva
Hi, Could someone recommend the monitoring tools for spark streaming? By extending StreamingListener we can dump the delay in processing of batches and some alert messages. But are there any Web UI tools where we can monitor failures, see delays in processing, error messages and setup alerts

Does YARN start new executor in place of the failed one?

2015-09-28 Thread Alexander Pivovarov
Hello Everyone I use Spark on YARN on EMR-4 The spark program which I run has several jobs/stages and run for about 10 hours During the execution some executors might fail for some reason. BUT I do not see that new executor are started in place of the failed ones So, what I see in spark UI is

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Also, you may find more details in the programming guide: - http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging - http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration Cheng On 9/28/15 3:54 PM, Cheng Lian wrote: I guess you're probably using

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Could you please elaborate on what kind of errors are those bad Parquet files causing? In what ways are they miswritten? Cheng On 9/28/15 4:03 PM, jordan.tho...@accenture.com wrote: Ah, yes, I see that it has been turned off now, that’s why it wasn’t working. Thank you, this is helpful!

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Probably parquet-tools and the following shell script helps: root="/path/to/your/data" for f in `find $root -type f -name "*.parquet"`; do parquet-schema $f 2&>1 /dev/null if [ ! -z $? ]; then echo $f; fi end This should print out all non-Parquet files under $root. Please refer to this

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Dear Michael, Thank you very much for your help. I should have mentioned in my original email, I did try the sequence notation. It doesn’t seem to have the desired effect. Maybe I should say that each one of these files has a different schema. When I use that call, I’m not ending up with a

nested collection object query

2015-09-28 Thread tridib
Hi Friends, What is the right syntax to query on collection of nested object? I have a following schema and SQL. But it does not return anything. Is the syntax correct? root |-- id: string (nullable = false) |-- employee: array (nullable = false) ||-- element: struct (containsNull = true)

RE: Performance when iterating over many parquet files

2015-09-28 Thread jordan.thomas
Ok thanks. Actually we ran something very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few

Re: SQL queries in Spark / YARN

2015-09-28 Thread Robert Grandl
Thanks Mark. Do you know how ? In Spark standalone mode I use beeline to submit SQL scripts. In Spark/YARN, the only way I can see this will work is using spark-submit. However as it looks, I need to encapsulate the SQL queries in a Scala file. Do you have other suggestions ? Thanks,Robert

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Oh I see, then probably this one, basically the parallel Spark version of my last script, using ParquetFileReader: import org.apache.parquet.hadoop.ParquetFileReader import org.apache.parquet.format.converter.ParquetMetadataConverter val badFiles = sc.parallelize(paths).mapPartitions {

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Oh I see, then probably this one, basically the parallel Spark version of my last script, using ParquetFileReader: import org.apache.parquet.hadoop.ParquetFileReader import org.apache.parquet.format.converter.ParquetMetadataConverter val badFiles = sc.parallelize(paths).mapPartitions {

Re: Monitoring tools for spark streaming

2015-09-28 Thread Shixiong Zhu
Which version are you using? Could you take a look at the new Streaming UI in 1.4.0? Best Regards, Shixiong Zhu 2015-09-29 7:52 GMT+08:00 Siva : > Hi, > > Could someone recommend the monitoring tools for spark streaming? > > By extending StreamingListener we can dump the

spark-submit classloader issue...

2015-09-28 Thread Rachana Srivastava
Hello all, Goal: I want to use APIs from HttpClient library 4.4.1. I am using maven shaded plugin to generate JAR. Findings: When I run my program as a java application within eclipse everything works fine. But when I am running the program using spark-submit I am getting following

Re: Spark streaming job filling a lot of data in local spark nodes

2015-09-28 Thread Shixiong Zhu
These files are created by shuffle and just some temp files. They are not necessary for checkpointing and only stored in your local temp directory. They will be stored in "/tmp" by default. You can use `spark.local.dir` to set the path if you find your "/tmp" doesn't have enough space. Best

Is MLBase dead?

2015-09-28 Thread Justin Pihony
As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and org.apache.spark.ml? I cannot find anything official, and the last updates seem to be a year or two old. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-MLBase-dead-tp24854.html Sent

Re: Monitoring tools for spark streaming

2015-09-28 Thread Hari Shreedharan
+1. The Streaming UI should give you more than enough information. Thanks, Hari On Mon, Sep 28, 2015 at 9:55 PM, Shixiong Zhu wrote: > Which version are you using? Could you take a look at the new Streaming UI > in 1.4.0? > Best Regards, > Shixiong Zhu > 2015-09-29 7:52

Re: spark.streaming.concurrentJobs

2015-09-28 Thread Shixiong Zhu
"writing to HBase for each partition in the RDDs from a given DStream is an independent output operation" This is not correct. "writing to HBase for each partition in the RDDs from a given DStream" is just a task. And they already run in parallel. The output operation is the DStream action, such

Re: Get variable into Spark's foreachRDD function

2015-09-28 Thread Ankur Srivastava
Hi, You are creating a logger instance on driver and then trying to use that instance in a transformation function which is executed on the executor. You should create logger instance in the transformation function itself but then the logs will go to separate files on each worker node. Hope

Merging two avro RDD/DataFrames

2015-09-28 Thread TEST ONE
I have a daily update of modified users (~100s) output as avro from ETL. I’d need to find and merge with existing corresponding members in a master avro file (~100,000s) The merge operation involves merging a ‘profiles’ Map between the matching records. What would be the

Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
Hi, I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while each machine has 12GB of RAM and 4 cores. On each machine I have one worker which is running one executor that grabs all 4 cores. I am interested to check the performance with "one worker but 4 executors per machine -

RE: nested collection object query

2015-09-28 Thread java8964
Your employee in fact is an array of struct, not just struct. If you are using HiveSQLContext, then you can refer it like following: select id from member where employee[0].name = 'employee0' The employee[0] is pointing to the 1st element of the array. If you want to query all the elements in the

CassandraSQLContext throwing NullPointer Exception

2015-09-28 Thread Priya Ch
Hi All, I am trying to use dataframes (which contain data from cassandra) in rdd.foreach. This is throwing the following exception: Is CassandraSQLContext accessible within executor 15/09/28 17:22:40 ERROR JobScheduler: Error running job streaming job 144344116 ms.0

Re: CassandraSQLContext throwing NullPointer Exception

2015-09-28 Thread Ted Yu
Which Spark release are you using ? Can you show the snippet of your code around CassandraSQLContext#sql() ? Thanks On Mon, Sep 28, 2015 at 6:21 AM, Priya Ch wrote: > Hi All, > > I am trying to use dataframes (which contain data from cassandra) in > rdd.foreach.

RE: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-28 Thread java8964
Hi, Lian: Thanks for the information. It works as expect in the spark with this setting. Yong Subject: Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive To: java8...@hotmail.com; user@spark.apache.org From: lian.cs@gmail.com

Re: Performance when iterating over many parquet files

2015-09-28 Thread Michael Armbrust
sqlContext.read.parquet takes lists of files. val fileList = sc.textFile("file_list.txt").collect() // this works but using spark is possibly overkill val dataFrame =

Is there any tool provides per-task monitoring to figure out task skew in Spark streaming?

2015-09-28 Thread 이기석
Hi this is a graduate student studying Spark streaming for research purpose. I want to know whether there is a task skew in my streaming application. But as far as I found out, the Spark UI does not provide any useful information to figure this. I found a related work from Spark Summit 2014:

SparkContext._active_spark_context returns None

2015-09-28 Thread YiZhi Liu
Hi, I'm doing some data processing on pyspark, but I failed to reach JVM in workers. Here is what I did: $ bin/pyspark >>> data = sc.parallelize(["123", "234"]) >>> numbers = data.map(lambda s: >>> SparkContext._active_spark_context._jvm.java.lang.Integer.valueOf(s.strip())) >>>

Spark SQL: Implementing Custom Data Source

2015-09-28 Thread Jerry Lam
Hi spark users and developers, I'm trying to learn how implement a custom data source for Spark SQL. Is there a documentation that I can use as a reference? I'm not sure exactly what needs to be extended/implemented. A general workflow will be greatly helpful! Best Regards, Jerry

Update cassandra rows problem

2015-09-28 Thread amine_901
Hello all, i'm using spark 1.2 with spark cassandra connector 1.2.3, i'm trying to update somme rows of table: example: *CREATE TABLE myTable ( a text, b text, c text, date timestamp, d text, e text static, f text static, PRIMARY KEY ((a, b, c), date, d) ) WITH