Re: Exception handling in Spark

2020-05-05 Thread Todd Nist
Could you do something like this prior to calling the action. // Create FileSystem object from Hadoop Configuration val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) // This methods returns Boolean (true - if file exists, false - if file doesn't exist val fileExists = fs.exists(new

Re: Using P4J Plugins with Spark

2020-04-21 Thread Todd Nist
You may want to make sure you include the jar of P4J and your plugins as part of the following so that both the driver and executors have access. If HDFS is out then you could make a common mount point on each of the executor nodes so they have access to the classes. - spark-submit --jars

Re: spark.submit.deployMode: cluster

2019-03-29 Thread Todd Nist
A little late, but have you looked at https://livy.incubator.apache.org/, works well for us. -Todd On Thu, Mar 28, 2019 at 9:33 PM Jason Nerothin wrote: > Meant this one: https://docs.databricks.com/api/latest/jobs.html > > On Thu, Mar 28, 2019 at 5:06 PM Pat Ferrel wrote: > >> Thanks, are

Re: cache table vs. parquet table performance

2019-01-16 Thread Todd Nist
Hi Tomas, Have you considered using something like https://www.alluxio.org/ for you cache? Seems like a possible solution for what your trying to do. -Todd On Tue, Jan 15, 2019 at 11:24 PM 大啊 wrote: > Hi ,Tomas. > Thanks for your question give me some prompt.But the best way use cache >

Re: Backpressure initial rate not working

2018-07-26 Thread Todd Nist
figuration=log4j-spark.properties" \ >--files "${JAAS_CONF},${KEYTAB}" \ >--class "${MAIN_CLASS}" \ >"${ARTIFACT_FILE}" > > > The first batch is huge, even if it worked for the first batch I would've > tried researching more. The

Re: Backpressure initial rate not working

2018-07-26 Thread Todd Nist
Hi Biplob, How many partitions are on the topic you are reading from and have you set the maxRatePerPartition? iirc, spark back pressure is calculated as follows: *Spark back pressure:* Back pressure is calculated off of the following: • maxRatePerPartition=200 • batchInterval 30s • 3

Re: Tableau BI on Spark SQL

2017-01-30 Thread Todd Nist
Hi Mich, You could look at http://www.exasol.com/. It works very well with Tableau without the need to extract the data. Also in V6, it has the virtual schemas which would allow you to access data in Spark, Hive, Oracle, or other sources. May be outside of what you are looking for, it works

Re: is there any bug for the configuration of spark 2.0 cassandra spark connector 2.0 and cassandra 3.0.8

2016-09-20 Thread Todd Nist
These types of questions would be better asked on the user mailing list for the Spark Cassandra connector: http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user Version compatibility can be found here:

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Todd Nist
Hi Mich, Have you looked at Apache Ignite? https://apacheignite-fs.readme.io/docs. This looks like something that may be what your looking for: http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin HTH. -Todd On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Todd Nist
Hi Mich, Perhaps the issue is having multiple SparkContexts in the same JVM ( https://issues.apache.org/jira/browse/SPARK-2243). While it is possible, I don't think it is encouraged. As you know, the call your currently invoking to create the StreamingContext also creates a SparkContext. /** *

Re: Design patterns involving Spark

2016-08-30 Thread Todd Nist
Have not tried this, but looks quite useful if one is using Druid: https://github.com/implydata/pivot - An interactive data exploration UI for Druid On Tue, Aug 30, 2016 at 4:10 AM, Alonso Isidoro Roman wrote: > Thanks Mitch, i will check it. > > Cheers > > > Alonso

Re: Writing to Hbase table from Spark

2016-08-30 Thread Todd Nist
Have you looked at spark-packges.org? There are several different HBase connectors there, not sure if any meet you need or not. https://spark-packages.org/?q=hbase HTH, -Todd On Tue, Aug 30, 2016 at 5:23 AM, ayan guha wrote: > You can use rdd level new hadoop format

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Todd Nist
This is due to a change in 1.6, by default the Thrift server runs in multi-session mode. You would want to set the following to true on your spark config. spark-default.conf set spark.sql.hive.thriftServer.singleSession Good write up here:

Re: Load selected rows with sqlContext in the dataframe

2016-07-21 Thread Todd Nist
You can set the dbtable to this: .option("dbtable", "(select * from master_schema where 'TID' = '100_0')") HTH, Todd On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog wrote: > I have a table of size 5GB, and want to load selective rows into dataframe > instead of loading

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist
quorum defined in > config, running in standalone mode > (org.apache.zookeeper.server.quorum.QuorumPeerMain) > > Any indication onto why the channel connection might be closed? Would it > be Kafka or Zookeeper related? > > On 07 Jun 2016, at 14:07, Todd Nist <tsind...@gmail.c

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist
What version of Spark are you using? I do not believe that 1.6.x is compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2 and 0.9.0.x. See this for more information: https://issues.apache.org/jira/browse/SPARK-12177 -Todd On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread Todd Nist
Perhaps these may be of some use: https://github.com/mkuthan/example-spark http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ https://github.com/holdenk/spark-testing-base On Wed, May 18, 2016 at 2:14 PM, swetha kasireddy wrote: > Hi Lars, > > Do you have

Re: Spark SQL Transaction

2016-04-23 Thread Todd Nist
I believe the class you are looking for is org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala. By default in savePartition(...) , it will do the following: if (supportsTransactions) { conn.setAutoCommit(false) // Everything in the same db transaction. } Then at line 224, it will

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Todd Nist
Have you looked at these: http://allegro.tech/2015/08/spark-kafka-integration.html http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/ Full example here: https://github.com/mkuthan/example-spark-kafka HTH. -Todd On Thu, Apr 21, 2016 at 2:08 PM, Alexander Gallego

Re: How to change akka.remote.startup-timeout in spark

2016-04-21 Thread Todd Nist
I believe you can adjust it by setting the following: spark.akka.timeout 100s Communication timeout between Spark nodes. HTH. -Todd On Thu, Apr 21, 2016 at 9:49 AM, yuemeng (A) wrote: > When I run a spark application,sometimes I get follow ERROR: > > 16/04/21 09:26:45

Re: Apache Flink

2016-04-17 Thread Todd Nist
So there is an offering from Stratio, https://github.com/Stratio/Decision Decision CEP engine is a Complex Event Processing platform built on Spark > Streaming. > > It is the result of combining the power of Spark Streaming as a continuous > computing framework and Siddhi CEP engine as complex

Re: "bootstrapping" DStream state

2016-03-10 Thread Todd Nist
The updateStateByKey can be supplied an initialRDD to populate it with. Per code ( https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L435-L445 ). Provided here for your convenience. /** * Return a new "state"

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Todd Nist
Hi Vinti, All of your tasks are failing based on the screen shots provided. I think a few more details would be helpful. Is this YARN or a Standalone cluster? How much overall memory is on your cluster? On each machine where workers and executors are running? Are you using the Direct

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Todd Nist
Have you looked at Apache Toree, http://toree.apache.org/. This was formerly the Spark-Kernel from IBM but contributed to apache. https://github.com/apache/incubator-toree You can find a good overview on the spark-kernel here:

Re: Spark for client

2016-03-01 Thread Todd Nist
You could also look at Apache Toree, http://toree.apache.org/ , github : https://github.com/apache/incubator-toree. This use to be the Spark Kernel from IBM but has been contributed to Apache. Good overview here on its features,

Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist
cluster ? > Am I missing something obvious ? > > > Le dim. 28 févr. 2016 à 19:01, Todd Nist <tsind...@gmail.com> a écrit : > >> Define your SparkConfig to set the master: >> >> val conf = new SparkConf().setAppName(AppName) >> .setMaster(SparkMaster)

Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist
Define your SparkConfig to set the master: val conf = new SparkConf().setAppName(AppName) .setMaster(SparkMaster) .set() Where SparkMaster = "spark://SparkServerHost:7077". So if your spark server hostname it "RADTech" then it would be "spark://RADTech:7077". Then when you create

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Todd Nist
You could use the "withSessionDo" of the SparkCassandrConnector to preform the simple insert: CassandraConnector(conf).withSessionDo { session => session.execute() } -Todd On Tue, Feb 16, 2016 at 11:01 AM, Cody Koeninger wrote: > You could use sc.parallelize... but the

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Todd Nist
Hi Satish, You should be able to do something like this: val props = new java.util.Properties() props.put("user", username) props.put("password",pwd) props.put("driver", "org.postgresql.Drive") val deptNo = 10 val where = Some(s"dept_number = $deptNo") val df =

Re: NPE when using Joda DateTime

2016-01-14 Thread Todd Nist
I had a similar problem a while back and leveraged these Kryo serializers, https://github.com/magro/kryo-serializers. I had to fallback to version 0.28, but that was a while back. You can add these to the org.apache.spark.serializer.KryoRegistrator and then set your registrator in the spark

Re: GroupBy on DataFrame taking too much time

2016-01-11 Thread Todd Nist
Hi Rajeshwar Gaini, dbtable can be any valid sql query, simple define it as a sub query, something like: val query = "(SELECT country, count(*) FROM customer group by country) as X" val df1 = sqlContext.read .format("jdbc") .option("url", url) .option("user", username)

Re: write new data to mysql

2016-01-08 Thread Todd Nist
Sorry, did not see your update until now. On Fri, Jan 8, 2016 at 3:52 PM, Todd Nist <tsind...@gmail.com> wrote: > Hi Yasemin, > > What version of Spark are you using? Here is the reference, it is off of > the DataFrame > https://spark.apache.org/docs/lates

Re: write new data to mysql

2016-01-08 Thread Todd Nist
that Todd mentioned or i cant find it. > The code and error are in gist > <https://gist.github.com/yaseminn/f5a2b78b126df71dfd0b>. Could you check > it out please? > > Best, > yasemin > > 2016-01-08 18:23 GMT+02:00 Todd Nist <tsind...@gmail.com>: > >> It

Re: write new data to mysql

2016-01-08 Thread Todd Nist
It is not clear from the information provided why the insertIntoJDBC failed in #2. I would note that method on the DataFrame as been deprecated since 1.4, not sure what version your on. You should be able to do something like this:

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
That should read "I think your missing the --name option". Sorry about that. On Wed, Jan 6, 2016 at 3:03 PM, Todd Nist <tsind...@gmail.com> wrote: > Hi Jade, > > I think you "--name" option. The makedistribution should look like this: > > ./make-distr

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
Hi Jade, I think you "--name" option. The makedistribution should look like this: ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests. As for why it failed to build with scala 2.11, did you run the

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
i.apache.org/confluence/display/MAVEN/PluginExecutionException > [ERROR] > [ERROR] After correcting the problems, you can resume the build with the > command > [ERROR] mvn -rf :spark-launcher_2.10 > > Do you think it’s java problem? I’m using oracle JDK 1.7. Should I update > it to

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Todd Nist
Another possible alternative is to register a StreamingListener and then reference the BatchInfo.numRecords; good example here, https://gist.github.com/akhld/b10dc491aad1a2007183. After registering the listener, Simply implement the appropriate "onEvent" method where onEvent is onBatchStarted,

Re: Securing objects on the thrift server

2015-12-15 Thread Todd Nist
see https://issues.apache.org/jira/browse/SPARK-11043, it is resolved in 1.6. On Tue, Dec 15, 2015 at 2:28 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > The one coming with spark 1.5.2. > > > > y > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* December-15-15 1:59 PM

Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Todd Nist
Perhaps the new trackStateByKey targeted for very 1.6 may help you here. I'm not sure if it is part of 1.6 or not for sure as the jira does not specify a fixed version. The jira describing it is here: https://issues.apache.org/jira/browse/SPARK-2629, and the design doc that discusses the API

Re: Spark Driver Port Details

2015-11-25 Thread Todd Nist
The default is to start applications with port 4040 and then increment them by 1 as you are seeing, see docs here: http://spark.apache.org/docs/latest/monitoring.html#web-interfaces You can override this behavior by setting passing the --conf spark.ui.port=4080 or in your code; something like

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist
Hi Abhi, You should be able to register a org.apache.spark.streaming.scheduler.StreamListener. There is an example here that may help: https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs here,

Re: Getting the batch time of the active batches in spark streaming

2015-11-24 Thread Todd Nist
(StreamingListenerBatchSubmitted batchSubmitted) { system.out.println("Start time: " + batchSubmitted.batchInfo.processingStartTime) } Sorry for the confusion. -Todd On Tue, Nov 24, 2015 at 7:51 PM, Todd Nist <tsind...@gmail.com> wrote: > Hi Abhi, > > You s

Re: Maven build failed (Spark master)

2015-10-27 Thread Todd Nist
I issued the same basic command and it worked fine. RADTech-MBP:spark $ ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests Which created: spark-1.6.0-SNAPSHOT-bin-hadoop-2.6.tgz in the root directory of the project.

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist
2.11 artifacts are in fact published: > http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22 > > On Sun, Oct 25, 2015 at 7:37 PM, Todd Nist <tsind...@gmail.com> wrote: > > Sorry Sean you are absolutely right it supports 2.11 all o meant is > there is > >

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist
Hi Bilnmek, Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it build it like your trying. Here are the steps I followed to build it on a Max OS X 10.10.5 environment, should be very similar on ubuntu. 1. set theJAVA_HOME environment variable in my bash session via export

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist
t support 2.11? It does. > > It is not even this difficult; you just need a source distribution, > and then run "./dev/change-scala-version.sh 2.11" as you say. Then > build as normal > > On Sun, Oct 25, 2015 at 4:00 PM, Todd Nist <tsind...@gmail.com > <javascrip

Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Todd Nist
Hi Yifan, You could also try increasing the spark.kryoserializer.buffer.max.mb *spark.kryoserializer.buffer.max.mb *(64 Mb by default) : useful if your default buffer size goes further than 64 Mb; Per doc: Maximum allowable size of Kryo serialization buffer. This must be larger than any object

Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Todd Nist
>From tableau, you should be able to use the Initial SQL option to support this: So in Tableau add the following to the “Initial SQL” create function myfunc AS 'myclass' using jar 'hdfs:///path/to/jar'; HTH, Todd On Mon, Oct 19, 2015 at 11:22 AM, Deenar Toraskar

Re: KafkaProducer using Cassandra as source

2015-09-23 Thread Todd Nist
Hi Kali, If you do not mind sending JSON, you could do something like this, using json4s: val rows = p.collect() map ( row => TestTable(row.getString(0), row.getString(1)) ) val json = parse(write(rows)) producer.send(new KeyedMessage[String, String]("trade", writePretty(json))) // or for

Re: Replacing Esper with Spark Streaming?

2015-09-14 Thread Todd Nist
Stratio offers a CEP implementation based on Spark Streaming and the Siddhi CEP engine. I have not used the below, but they may be of some value to you: http://stratio.github.io/streaming-cep-engine/ https://github.com/Stratio/streaming-cep-engine HTH. -Todd On Sun, Sep 13, 2015 at 7:49 PM,

Re: Tungsten and Spark Streaming

2015-09-10 Thread Todd Nist
https://issues.apache.org/jira/browse/SPARK-8360?jql=project%20%3D%20SPARK%20AND%20text%20~%20Streaming -Todd On Thu, Sep 10, 2015 at 10:22 AM, Gurvinder Singh < gurvinder.si...@uninett.no> wrote: > On 09/10/2015 07:42 AM, Tathagata Das wrote: > > Rewriting is necessary. You will have to

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Todd Nist
on a streaming app ? Thanks again. Daniel On Thu, Aug 6, 2015 at 1:53 AM, Todd Nist tsind...@gmail.com wrote: Hi Danniel, It is possible to create an instance of the SparkSQL Thrift server, however seems like this project is what you may be looking for: https://github.com/Intel-bigdata/spark

Re: How can I know currently supported functions in Spark SQL

2015-08-06 Thread Todd Nist
They are covered here in the docs: http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wanglong_...@163.com wrote: Hi All, I am using Spark 1.4.1, and I want to know how can I find the complete function

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Todd Nist
Hi Danniel, It is possible to create an instance of the SparkSQL Thrift server, however seems like this project is what you may be looking for: https://github.com/Intel-bigdata/spark-streamingsql Not 100% sure of your use case is, but you can always convert the data into DF then issue a query

Re: Does Spark streaming support is there with RabbitMQ

2015-07-20 Thread Todd Nist
There is one package available on the spark-packages site, http://spark-packages.org/package/Stratio/RabbitMQ-Receiver The source is here: https://github.com/Stratio/RabbitMQ-Receiver Not sure that meets your needs or not. -Todd On Mon, Jul 20, 2015 at 8:52 AM, Jeetendra Gangele

Re: Use rank with distribute by in HiveContext

2015-07-16 Thread Todd Nist
Did you take a look at the excellent write up by Yin Huai and Michael Armbrust? It appears that rank is supported in the 1.4.x release. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html Snippet from above article for your convenience: To answer the first

Re: spark streaming job to hbase write

2015-07-15 Thread Todd Nist
There are there connector packages listed on spark packages web site: http://spark-packages.org/?q=hbase HTH. -Todd On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some

Re: Saving RDD into cassandra keyspace.

2015-07-10 Thread Todd Nist
I would strongly encourage you to read the docs at, they are very useful in getting up and running: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md For your use case shown above, you will need to ensure that you include the appropriate version of the

Re: [X-post] Saving SparkSQL result RDD to Cassandra

2015-07-09 Thread Todd Nist
foreachRDD returns a unit: def foreachRDD(foreachFunc: (RDD https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html [T]) ⇒ Unit): Unit Apply a function to each RDD in this DStream. This is an output operator, so 'this' DStream will be registered as an output stream and

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
to be a limitation at this time. -Todd On Thu, Jul 2, 2015 at 4:13 PM, Mulugeta Mammo mulugeta.abe...@gmail.com wrote: thanks but my use case requires I specify different start and max heap sizes. Looks like spark sets start and max sizes same value. On Thu, Jul 2, 2015 at 1:08 PM, Todd Nist tsind

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
You should use: spark.executor.memory from the docs https://spark.apache.org/docs/latest/configuration.html: spark.executor.memory512mAmount of memory to use per executor process, in the same format as JVM memory strings (e.g.512m, 2g). -Todd On Thu, Jul 2, 2015 at 3:36 PM, Mulugeta Mammo

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Todd Nist
You can get HDP with at least 1.3.1 from Horton: http://hortonworks.com/hadoop-tutorial/using-apache-spark-technical-preview-with-hdp-2-2/ for your convenience from the dos: wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.2.4.4/hdp.repo -O /etc/yum.repos.d/HDP-TP.repo

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Todd Nist
Hi Proust, Is it possible to see the query you are running and can you run EXPLAIN EXTENDED to show the physical plan for the query. To generate the plan you can do something like this from $SPARK_HOME/bin/beeline: 0: jdbc:hive2://localhost:10001 explain extended select * from YourTableHere;

Re: Spark 1.4 release date

2015-06-12 Thread Todd Nist
It was released yesterday. On Friday, June 12, 2015, ayan guha guha.a...@gmail.com wrote: Hi When is official spark 1.4 release date? Best Ayan

Re: How to pass arguments dynamically, that needs to be used in executors

2015-06-11 Thread Todd Nist
Hi Gaurav, Seems like you could use a broadcast variable for this if I understand your use case. Create it in the driver based on the CommandLineArguments and then use it in the workers. https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables So something like:

Re: Spark SQL and Streaming Results

2015-06-05 Thread Todd Nist
There use to be a project, StreamSQL ( https://github.com/thunderain-project/StreamSQL), but it appears a bit dated and I do not see it in the Spark repo, but may have missed it. @TD Is this project still active? I'm not sure what the status is but it may provide some insights on how to achieve

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-23 Thread Todd Nist
://datastax-oss.atlassian.net/browse/SPARKC-98 is still open... On Fri, May 22, 2015 at 6:15 PM, Todd Nist tsind...@gmail.com wrote: I'm using the spark-cassandra-connector from DataStax in a spark streaming job launched from my own driver. It is connecting a a standalone cluster on my local box which

spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Todd Nist
I'm using the spark-cassandra-connector from DataStax in a spark streaming job launched from my own driver. It is connecting a a standalone cluster on my local box which has two worker running. This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT. I have added the following entry to

Re: Question about Serialization in Storage Level

2015-05-21 Thread Todd Nist
From the docs, https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence: Storage LevelMeaningMEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Todd Nist
I believe your looking for df.na.fill in scala, in pySpark Module it is fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html) from the docs: df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10 80 Alice5 null Bob50 null Tom50 null unknown On

Re: group by and distinct performance issue

2015-05-19 Thread Todd Nist
You may want to look at this tooling for helping identify performance issues and bottlenecks: https://github.com/kayousterhout/trace-analysis I believe this is slated to become part of the web ui in the 1.4 release, in fact based on the status of the JIRA,

Re: value toDF is not a member of RDD object

2015-05-13 Thread Todd Nist
I believe what Dean Wampler was suggesting is to use the sqlContext not the sparkContext (sc), which is where the createDataFrame function resides: https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.SQLContext HTH. -Todd On Wed, May 13, 2015 at 6:00 AM, SLiZn Liu

Re: Spark does not delete temporary directories

2015-05-07 Thread Todd Nist
Have you tried to set the following? spark.worker.cleanup.enabled=true spark.worker.cleanup.appDataTtl=seconds” On Thu, May 7, 2015 at 2:39 AM, Taeyun Kim taeyun@innowireless.com wrote: Hi, After a spark program completes, there are 3 temporary directories remain in the temp

Re: AvroFiles

2015-05-05 Thread Todd Nist
Are you using Kryo or Java serialization? I found this post useful: http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist If using kryo, you need to register the classes with kryo, something like this: sc.registerKryoClasses(Array(

Parquet Partition Strategy - how to partition data correctly

2015-05-05 Thread Todd Nist
Hi, I have a DataFrame that represents my data looks like this: +-++ | col_name| data_type | +-++ | obj_id | string | | type| string | | name

Spark Streaming Kafka Avro NPE on deserialization of payload

2015-05-01 Thread Todd Nist
*Resending as I do not see that this made it to the mailing list, sorry if in fact it did an is just nor reflected online yet.* I’m very perplexed with the following. I have a set of AVRO generated objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming job follows the

Spark Streaming Kafka Avro NPE on deserialization of payload

2015-04-30 Thread Todd Nist
I’m very perplexed with the following. I have a set of AVRO generated objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming job follows the receiver-based approach. I am encountering the below error when I attempt to de serialize the payload: 15/04/30 17:49:25 INFO

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Todd Nist
Can you simply apply the https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.util.StatCounter to this? You should be able to do something like this: val stats = RDD.map(x = x._2).stats() -Todd On Tue, Apr 28, 2015 at 10:00 AM, subscripti...@prismalytics.io

Re: Cannot saveAsParquetFile from a RDD of case class

2015-04-14 Thread Todd Nist
I think docs are correct. If you follow the example from the docs and add this import shown below, I believe you will get what your looking for: // This is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits._ You could also simply take your rdd and do the following:

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-09 Thread Todd Nist
down where the dependency was coming from. Based on Patrick comments it sound like this is now resolved. Sorry for the confustion. -Todd On Wed, Apr 8, 2015 at 4:38 PM, Todd Nist tsind...@gmail.com wrote: Hi Mohammed, I think you just need to add -DskipTests to you build. Here is how I built

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist
To use the HiveThriftServer2.startWithContext, I thought one would use the following artifact in the build: org.apache.spark%% spark-hive-thriftserver % 1.3.0 But I am unable to resolve the artifact. I do not see it in maven central or any other repo. Do I need to build Spark and

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist
org.apache.spark#spark-network-shuffle_2.10;1.3.0 test [error] Total time: 106 s, completed Apr 8, 2015 12:33:45 PM Mohammed *From:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Wednesday, April 8, 2015 11:54 AM *To:* Mohammed Guller *Cc:* Todd Nist; James Aley; user; Patrick

Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Todd Nist
In 1.2.1 of I was persisting a set of parquet files as a table for use by spark-sql cli later on. There was a post here http://apache-spark-user-list.1001560.n3.nabble.com/persist-table-schema-in-spark-sql-tt16297.html#a16311 by Mchael Armbrust that provide a nice little helper method for dealing

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
is download location ? On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote: Started the spark shell with the one jar from hive suggested: ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
definition (code) of UDF json_tuple. That should solve your problem. On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote: I placed it there. It was downloaded from MySql site. On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Akhil you mentioned /usr/local

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Todd Nist
Thanks Best Regards On Fri, Apr 3, 2015 at 2:55 PM, Todd Nist tsind...@gmail.com wrote: Hi Akhil, This is for version 1.2.1. Well the other thread that you reference was me attempting it in 1.3.0 to see if the issue was related to 1.2.1. I did not build Spark but used the version from

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
What version of Cassandra are you using? Are you using DSE or the stock Apache Cassandra version? I have connected it with DSE, but have not attempted it with the standard Apache Cassandra version. FWIW,

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
in Tableau using the ODBC driver that comes with DSE. Once you connect, Tableau allows to use C* keyspace as schema and column families as tables. Mohammed *From:* pawan kumar [mailto:pkv...@gmail.com] *Sent:* Friday, April 3, 2015 7:41 AM *To:* Todd Nist *Cc:* user@spark.apache.org; Mohammed

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
@Pawan Not sure if you have seen this or not, but here is a good example by Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding Tableau is as simple as Mohammed stated with DSE. https://github.com/jlacefie/sparksqltest. HTH, Todd On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
are in the remote node. I am not sure if i need to install spark and its dependencies in the webui (zepplene) node. I am not sure talking about zepplelin in this thread is right. Thanks once again for all the help. Thanks, Pawan Venugopal On Fri, Apr 3, 2015 at 11:48 AM, Todd Nist tsind

Re: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-03 Thread Todd Nist
CalliopeServer2, which works like a charm with BI tools that use JDBC, but unfortunately Tableau throws an error when it connects to it. Mohammed *From:* Todd Nist [mailto:tsind...@gmail.com] *Sent:* Friday, April 3, 2015 11:39 AM *To:* pawan kumar *Cc:* Mohammed Guller; user@spark.apache.org

Re: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist
Hi Young, Sorry for the duplicate post, want to reply to all. I just downloaded the bits prebuilt form apache spark download site. Started the spark shell and got the same error. I then started the shell as follows: ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-02 Thread Todd Nist
. If you want the specific jar, you could look fr jackson or json serde in it. Thanks Best Regards On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote: I have a feeling I’m missing a Jar that provides the support or could this may be related to https://issues.apache.org/jira

Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist
I was trying a simple test from the spark-shell to see if 1.3.0 would address a problem I was having with locating the json_tuple class and got the following error: scala import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive._ scala val sqlContext = new HiveContext(sc)sqlContext:

SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to expose it via SparkSQL. I am using spark 1.2.1, latest supported by elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop % 2.1.0.BUILD-SNAPSHOT of elasticsearch-hadoop. I’m encountering an issue when I

Re: Query REST web service with Spark?

2015-03-31 Thread Todd Nist
Here are a few ways to achieve what your loolking to do: https://github.com/cjnolet/spark-jetty-server Spark Job Server - https://github.com/spark-jobserver/spark-jobserver - defines a REST API for Spark Hue -

Re: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
at 3:26 PM, Todd Nist tsind...@gmail.com wrote: I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to expose it via SparkSQL. I am using spark 1.2.1, latest supported by elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop % 2.1.0.BUILD-SNAPSHOT of elasticsearch

Re: Spark as a service

2015-03-24 Thread Todd Nist
Perhaps this project, https://github.com/calrissian/spark-jetty-server, could help with your requirements. On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com wrote: I don't think there's are general approach to that - the usecases are just to different. If you really need

Re: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-19 Thread Todd Nist
: Seems the elasticsearch-hadoop project was built with an old version of Spark, and then you upgraded the Spark version in execution env, as I know the StructField changed the definition in Spark 1.2, can you confirm the version problem first? *From:* Todd Nist [mailto:tsind...@gmail.com] *Sent

  1   2   >