Re: SparkContext & Threading

2015-06-06 Thread Lee McFadden
Jun 6, 2015, 12:21 AM Will Briggs wrote: > Hi Lee, it's actually not related to threading at all - you would still > have the same problem even if you were using a single thread. See this > section ( > https://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to

Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrote: > Your lambda expressions on the RDDs in the SecondRollup class are closing > around the context, and Spark has special logic to ensure that all > variables in a closure used on an RDD are Serializable - I hate linking to > Quora, but there's a go

Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 1:00 PM Igor Berman wrote: > Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos? > Spark standalone, v1.2.1.

Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:58 PM Marcelo Vanzin wrote: > You didn't show the error so the only thing we can do is speculate. You're > probably sending the object that's holding the SparkContext reference over > the network at some point (e.g. it's used by a task run in an executor), > and that's w

Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:30 PM Marcelo Vanzin wrote: > Ignoring the serialization thing (seems like a red herring): > People seem surprised that I'm getting the Serialization exception at all - I'm not convinced it's a red herring per se, but on to the blocking issue... > > You might be using

Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
ool to complete, although it's not really required at the moment as I am only submitting one job until I get this issue straightened out :) Thanks, Lee On Fri, Jun 5, 2015 at 11:50 AM Marcelo Vanzin wrote: > On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden wrote: > >> In

SparkContext & Threading

2015-06-05 Thread Lee McFadden
y and haven't found any docs to point me in the right direction. Does anyone have any advice on how to get jobs submitted by multiple threads? The jobs are fairly simple and work when I run them serially, so I'm not exactly sure what I'm doing wrong. Thanks, Lee

Hive Skew flag?

2015-05-15 Thread Denny Lee
Just wondering if we have any timeline on when the hive skew flag will be included within SparkSQL? Thanks! Denny

Re: how to delete data from table in sparksql

2015-05-14 Thread Denny Lee
Delete from table is available as part of Hive 0.14 (reference: Apache Hive > Language Manual DML - Delete ) while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive 0.14 or generate a new

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Lee McFadden
nagement is (thankfully) > different from Python dependency management. > > As far as I can tell, there is no core issue, upstream or otherwise. > > > > > > > On Tue, May 12, 2015 at 11:39 AM, Lee McFadden wrote: > >> Thanks again for all the help folks. >&

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Lee McFadden
Thanks again for all the help folks. I can confirm that simply switching to `--packages org.apache.spark:spark-streaming-kafka-assembly_2.10:1.3.1` makes everything work as intended. I'm not sure what the difference is between the two packages honestly, or why one should be used over the other, b

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Lee McFadden
s basically working-as-intended. > > On Tue, May 12, 2015 at 3:19 AM, Lee McFadden wrote: > > I opened a ticket on this (without posting here first - bad etiquette, > > apologies) which was closed as 'fixed'. > > > > https://issues.apache.org/jira/browse/SPARK-

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
istribution and my application itself can't be introducing java dependency clashes? On Mon, May 11, 2015, 4:34 PM Lee McFadden wrote: > Ted, many thanks. I'm not used to Java dependencies so this was a real > head-scratcher for me. > > Downloading the two metrics package

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
;t in the assembly, is it? you'd > have to provide it and all its dependencies with your app. You could > also build this into your own app jar. Tools like Maven will add in > the transitive dependencies. > > On Mon, May 11, 2015 at 10:04 PM, Lee McFadden wrote: > > Tha

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
e.Gauge is in metrics-core jar > e.g., in master branch: > [INFO] | \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile > [INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile > > Please make sure metrics-core jar is on the classpath. > > On Mon, May 11, 2015 at 1:3

Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
Hi, We've been having some issues getting spark streaming running correctly using a Kafka stream, and we've been going around in circles trying to resolve this dependency. Details of our environment and the error below, if anyone can help resolve this it would be much appreciated. Submit command

Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee
Similar to what Dean called out, we build Puppet manifests so we could do the automation - its a bit of work to setup, but well worth the effort. On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler wrote: > It's mostly manual. You could try automating with something like Chef, of > course, but there's

Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee
You may need to specify the hive port itself. For example, my own Thrift start command is in the form: ./sbin/start-thriftserver.sh --master spark://$myserver:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host $myserver --hiveconf hive.server2.thrift.port 1 HTH! O

RE: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-04-20 Thread Andrew Lee
.com > CC: user@spark.apache.org > > I think you want to take a look at: > https://issues.apache.org/jira/browse/SPARK-6207 > > On Mon, Apr 20, 2015 at 1:58 PM, Andrew Lee wrote: > > Hi All, > > > > Affected version: spark 1.2.1 / 1.2.2 / 1.3-rc1 > > >

GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-04-20 Thread Andrew Lee
Hi All, Affected version: spark 1.2.1 / 1.2.2 / 1.3-rc1 Posting this problem to user group first to see if someone is encountering the same problem. When submitting spark jobs that invokes HiveContext APIs on a Kerberos Hadoop + YARN (2.4.1) cluster, I'm getting this error. javax.security.sasl.

Re: Skipped Jobs

2015-04-19 Thread Denny Lee
Thanks for the correction Mark :) On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra wrote: > Almost. Jobs don't get skipped. Stages and Tasks do if the needed > results are already available. > > On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee wrote: > >> The job is skipp

Re: Skipped Jobs

2015-04-19 Thread Denny Lee
The job is skipped because the results are available in memory from a prior run. More info at: http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E. HTH! On Sun, Apr 19, 2015 at 1:43 PM James King wrote: > In th

Re: Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread Denny Lee
Support for sub queries in predicates hasn't been resolved yet - please refer to SPARK-4226 BTW, Spark 1.3 default bindings to Hive 0.13.1 On Fri, Apr 17, 2015 at 09:18 ARose wrote: > So I'm trying to store the results of a query into a DataFrame, but I get > the > following exception thrown

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee
Bummer - out of curiosity, if you were to use the classpath.first or perhaps copy the jar to the slaves could that actually do the trick? The latter isn't really all that efficient but just curious if that could do the trick. On Thu, Apr 16, 2015 at 7:14 AM ARose wrote: > I take it back. My so

Re: Converting Date pattern in scala code

2015-04-14 Thread Denny Lee
If you're doing in Scala per se - then you can probably just reference JodaTime or Java Date / Time classes. If are using SparkSQL, then you can use the various Hive date functions for conversion. On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA wrote: > I need some help to convert the date patte

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to Hive 0.12 if you specify it in the profile when building Spark as per https://spark.apache.org/docs/1.3.0/building-spark.html. If you are downloading a pre built version of Spark 1.3 - then by default, it is set to Hive 0.1

Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee
Can you create the database directly within Hive? If you're getting the same error within Hive, it sounds like a permissions issue as per Bojan. More info can be found at: http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error On Thu, Apr 9, 201

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
That's correct, at this time MS SQL Server is not supported through the JDBC data source at this time. In my environment, we've been using Hadoop streaming to extract out data from multiple SQL Servers, pushing the data into HDFS, creating the Hive tables and/or converting them into Parquet, and t

Re: Microsoft SQL jdbc support from spark sql

2015-04-06 Thread Denny Lee
At this time, the JDBC Data source is not extensible so it cannot support SQL Server. There was some thoughts - credit to Cheng Lian for this - about making the JDBC data source extensible for third party support possibly via slick. On Mon, Apr 6, 2015 at 10:41 PM bipin wrote: > Hi, I am try

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
I think something like this would work. You might need to play with the > type. > > df.explode("arrayBufferColumn") { x => x } > > > > On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee wrote: > >> Thanks Dean - fun hack :) >> >> On Fri, Apr 3, 2015

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
eilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee wrote: > >> Thanks Michael - that was it! I was

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Apr 2, 2015 at 7:10 PM, Denny Lee wrote: > >> Quick question - the output of a dataframe is in the format of: >> >> [2015-04, ArrayBuffer(A, B, C, D)] >> >> and I'd like to return it as: >> >> 2015-04, A >> 2015-04, B >> 2015-04, C >> 2015-04, D >> >> What's the best way to do this? >> >> Thanks in advance! >> >> >> >

ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!

Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee
Thanks Felix :) On Wed, Apr 1, 2015 at 00:08 Felix Cheung wrote: > This is tracked by these JIRAs.. > > https://issues.apache.org/jira/browse/SPARK-5947 > https://issues.apache.org/jira/browse/SPARK-5948 > > -- > From: denny.g@gmail.com > Date: Wed, 1 Apr 2015 04:

Creating Partitioned Parquet Tables via SparkSQL

2015-03-31 Thread Denny Lee
Creating Parquet tables via .saveAsTable is great but was wondering if there was an equivalent way to create partitioned parquet tables. Thanks!

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-30 Thread Denny Lee
Hi Vincent, This may be a case that you're missing a semi-colon after your CREATE TEMPORARY TABLE statement. I ran your original statement (missing the semi-colon) and got the same error as you did. As soon as I added it in, I was good to go again: CREATE TEMPORARY TABLE jsonTable USING org.apa

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
ld you > mind to open a JIRA for this? > > Cheng > > On 3/27/15 2:40 PM, Pei-Lun Lee wrote: > > I'm using 1.0.4 > > Thanks, > -- > Pei-Lun > > On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian wrote: > >> Hm, which version of Hadoop are you using? Actu

Re: Hive Table not from from Spark SQL

2015-03-27 Thread Denny Lee
Upon reviewing your other thread, could you confirm that your Hive metastore that you can connect to via Hive is a MySQL database? And to also confirm, when you're running spark-shell and doing a "show tables" statement, you're getting the same error? On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
ersion matters here, but I did observe > cases where Spark behaves differently because of semantic differences of > the same API in different Hadoop versions. > > Cheng > > On 3/27/15 11:33 AM, Pei-Lun Lee wrote: > > Hi Cheng, > > on my computer, execute res0.save(

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
n_metadata file is typically much smaller than _metadata, > because it doesn’t contain row group information, and thus can be faster to > read than _metadata. > > Cheng > > On 3/26/15 12:48 PM, Pei-Lun Lee wrote: > > Hi, > > When I save parquet file with SaveMode.Overwrite,

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee
If you're not using MySQL as your metastore for Hive, out of curiosity what are you using? The error you are seeing is common when there isn't the correct driver to allow Spark to connect to the Hive metastore because the correct driver isn't there. As well, I noticed that you're using SPARK_CLAS

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee
BTW, a tool that I have been using to help do the preaggregation of data using hyperloglog in combination with Spark is atscale (http://atscale.com/). It builds the aggregations and makes use of the speed of SparkSQL - all within the context of a model that is accessible by Tableau or Qlik. On Thu

Re: Which OutputCommitter to use for S3?

2015-03-25 Thread Pei-Lun Lee
I updated the PR for SPARK-6352 to be more like SPARK-3595. I added a new setting "spark.sql.parquet.output.committer.class" in hadoop configuration to allow custom implementation of ParquetOutputCommitter. Can someone take a look at the PR? On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun

SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee
Perhaps this email reference may be able to help from a DataFrame perspective: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang wrote: > Hi, > > > > I ha

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html). Please reference the Spark Properties section noting that you can modify these properties via the spark-defaults.conf or via SparkConf(). HTH!

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
t; instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient* > > Cheers, > Sandeep.v > > On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura > wrote: > >> No I am just running ./spark-shell command in terminal I will try with >> above command >> >> On Wed,

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/mysql-connector-java-5.1.27.jar

Re: Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Denny Lee
Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile -Phadoop-2.4 Please note earlier in the link the section: # Apache Hadoop 2.4.X or 2.5.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Versions of Hadoop after 2.5.X may or may not work with the -Ph

Re: Standalone Scheduler VS YARN Performance

2015-03-24 Thread Denny Lee
By any chance does this thread address look similar: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html ? On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan < harut.martiros...@gmail.com> wrote: > What is performance overhead caused by YARN

Re: Should I do spark-sql query on HDFS or hive?

2015-03-23 Thread Denny Lee
>From the standpoint of Spark SQL accessing the files - when it is hitting Hive, it is in effect hitting HDFS as well. Hive provides a great framework where the table structure is already well defined.But underneath it, Hive is just accessing files from HDFS so you are hitting HDFS either way.

Re: Using a different spark jars than the one on the cluster

2015-03-23 Thread Denny Lee
+1 - I currently am doing what Marcelo is suggesting as I have a CDH 5.2 cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in my cluster. On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin wrote: > Since you're using YARN, you should be able to download a Spark 1.3.0 > tarball

Re: Use pig load function in spark

2015-03-23 Thread Denny Lee
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do this: https://github.com/sigmoidanalytics/spork On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin wrote: > Hi, all > > > > Can spark use pig’s load function to load data? > > > > Best Regards, > > Kevin. >

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Denny Lee
How are you running your spark instance out of curiosity? Via YARN or standalone mode? When connecting Spark thriftserver to the Spark service, have you allocated enough memory and CPU when executing with spark? On Sun, Mar 22, 2015 at 3:39 AM fanooos wrote: > We have cloudera CDH 5.3 installe

Re: SparkSQL 1.3.0 JDBC data source issues

2015-03-19 Thread Pei-Lun Lee
JIRA and PR for first issue: https://issues.apache.org/jira/browse/SPARK-6408 https://github.com/apache/spark/pull/5087 On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee wrote: > Hi, > > I am trying jdbc data source in spark sql 1.3.0 and found some issues. > > First, the syntax

SparkSQL 1.3.0 JDBC data source issues

2015-03-18 Thread Pei-Lun Lee
Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax "where str_col='value'" will give error for both postgresql and mysql: psql> create table foo(id int primary key,name text,age int); bash> SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar spark/bin/spark-s

Re: Which OutputCommitter to use for S3?

2015-03-16 Thread Pei-Lun Lee
ect dependency makes this injection much more > difficult for saveAsParquetFile. > > On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee wrote: > >> Thanks for the DirectOutputCommitter example. >> However I found it only works for saveAsHadoopFile. What about >> saveAsParquetFile? &

Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee
Hi Rares, If you dig into the descriptions for the two jobs, it will probably return something like: Job ID: 1 org.apache.spark.rdd.RDD.takeSample(RDD.scala:447) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:22) ... Job ID: 0 org.apache.spark.rdd.RDD.takeSample(RDD.scala:428) $line41.$

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Pei-Lun Lee
Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about saveAsParquetFile? It looks like SparkSQL is using ParquetOutputCommitter, which is subclass of FileOutputCommitter. On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor wrote: > FYI. We're cur

Re: spark master shut down suddenly

2015-03-04 Thread Denny Lee
It depends on your setup but one of the locations is /var/log/mesos On Wed, Mar 4, 2015 at 19:11 lisendong wrote: > I ‘m sorry, but how to look at the mesos logs? > where are them? > > > > 在 2015年3月4日,下午6:06,Akhil Das 写道: > > > You can check in the mesos logs and see whats really happening. > >

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
e > location of default database for the > warehouse > > > Do I need to do anything explicitly other than placing hive-site.xml in > the spark.conf directory ? > > Thanks !! > > > > On Wed, Feb 25, 2015 at 11:42 AM, Denny Lee wrote: > >

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
The error message you have is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) Could you verify that you (the user you are running under) has the rights to create th

Re: How to start spark-shell with YARN?

2015-02-24 Thread Denny Lee
It may have to do with the akka heartbeat interval per SPARK-3923 - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ? On Tue, Feb 24, 2015 at 16:40 Xi Shen wrote: > Hi Sean, > > I launched the spark-shell on the same machine as I started YARN service. > I don't think port

Re: Use case for data in SQL Server

2015-02-24 Thread Denny Lee
Hi Suhel, My team is currently working with a lot of SQL Server databases as one of our many data sources and ultimately we pull the data into HDFS from SQL Server. As we had a lot of SQL databases to hit, we used the jTDS driver and SQOOP to extract the data out of SQL Server and into HDFS (smal

Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
015 at 12:29 AM, Davies Liu wrote: > How many executors you have per machine? It will be helpful if you > could list all the configs. > > Could you also try to run it without persist? Caching do hurt than > help, if you don't have enough memory. > > On Fri, Feb 20, 2015 a

Re: Spark SQL odbc on Windows

2015-02-23 Thread Denny Lee
imited. And thanks for writing the klout paper!! We were already > using it as a guideline for our tests. > > Best regards, > Francisco > -- > From: Denny Lee > Sent: ‎22/‎02/‎2015 17:56 > To: Ashic Mahtab ; Francisco Orchard ; > Apache Spark

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Back to thrift, there was an earlier thread on this topic at http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E that may be useful as well. On Sun Feb 22 2015 at 8:42:29 AM Denny Lee wrote: > Hi Franci

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Hi Francisco, Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my interest. The one thing that you may run into is that the SQL generated by SSAS can be quite convoluted. When we were doing the same thing t

Re: Spark Performance on Yarn

2015-02-20 Thread Lee Bierman
Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to fi

Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
t; > On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee wrote: > >> Quickly reviewing the latest SQL Programming Guide >> <https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md> >> (in github) I had a couple of quick questions: >> >> 1) Do we need t

Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
Quickly reviewing the latest SQL Programming Guide (in github) I had a couple of quick questions: 1) Do we need to instantiate the SparkContext as per // sc is an existing SparkContext. val sqlContext = new org.apache.spar

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-17 Thread Andrew Lee
HI All, Just want to give everyone an update of what worked for me. Thanks for Cheng's comment and other ppl's help. So what I misunderstood was the --driver-class-path and how that was related to --files. I put both /etc/hive/hive-site.xml in both --files and --driver-class-path when I started

RE: SparkSQL + Tableau Connector

2015-02-17 Thread Andrew Lee
: Running query ' cache table test ' 15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with curMem=101514, maxMem=278019440 15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 206.4 KB, free 264.8 MB) I see no way in

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. From: alee...@hotmail.com To: ar...@sigmoidanalytics.com; tsind...@gmail.com CC: user@spark.apache.org Subject: RE: SparkSQL + Tableau Connector Date: Wed,

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
heck your hive-site.xml. Are you directing to the hive server 2 port instead of spark thrift port? Their default ports are both 1. From: Andrew Lee [mailto:alee...@hotmail.com] Sent: Wednesday, February 11, 2015 12:00 PM To: sjbrunst; user@spark.apache.org Subject: RE: Is the Th

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
I have ThriftServer2 up and running, however, I notice that it relays the query to HiveServer2 when I pass the hive-site.xml to it. I'm not sure if this is the expected behavior, but based on what I have up and running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez query

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
I'm using mysql as the metastore DB with Spark 1.2.I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in /etc/spark/, everything works fine now. My setup looks like this. Tableau => Spark ThriftServer2 => HiveServer2 It's talking to Tableau Desktop 8.3. In

RE: hadoopConfiguration for StreamingContext

2015-02-10 Thread Andrew Lee
It looks like this is related to the underlying Hadoop configuration. Try to deploy the Hadoop configuration with your job with --files and --driver-class-path, or to the default /etc/hadoop/conf core-site.xml. If that is not an option (depending on how your Hadoop cluster is setup), then hard co

Re: Tableau beta connector

2015-02-05 Thread Denny Lee
t(sc) > > > Do some processing on RDD and persist it on hive using registerTempTable > > and tableau can extract that RDD persisted on hive. > > > Regards, > > Ashutosh > > > -- > *From:* Denny Lee > > *Sent:* Thursday, Fe

Re: Tableau beta connector

2015-02-04 Thread Denny Lee
rrect me if I am wrong. > > > I guess I have to look at how thrift server works. > -- > *From:* Denny Lee > *Sent:* Thursday, February 5, 2015 12:20 PM > *To:* İsmail Keskin; Ashutosh Trivedi (MT2013030) > *Cc:* user@spark.apache.org > *Subjec

Re: Tableau beta connector

2015-02-04 Thread Denny Lee
Some quick context behind how Tableau interacts with Spark / Hive can also be found at https://www.concur.com/blog/en-us/connect-tableau-to-sparksql - its for how to connect from Tableau to the thrift server before the official Tableau beta connector but should provide some of the additional conte

Re: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Denny Lee
Hi Ningjun, I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely for development purposes). I had most recently installed them utilizing Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy thread concerning the null\bin\winutils issue is addressed in

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted is at: OLAP with Cassandra and Spark http://www.slideshare.net/EvanChan2/2014-07olapcassspark. On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad wrote: > Write out the rdd to a cassandra table. The datastax driver provid

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-03 Thread Andrew Lee
Hi All, In Spark 1.2.0-rc1, I have tried to set the hive.metastore.warehouse.dir to share with the Hive warehouse location on HDFS, however, it does NOT work on yarn-cluster mode. On the Namenode audit log, I see that spark is trying to access the default hive warehouse location which is /user/

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
' suffix is legitimate. > > On Sun, Feb 1, 2015 at 9:14 AM, Denny Lee wrote: > >> I may be missing something here but typically when the hive-site.xml >> configurations do not require you to place "s" within the configuration >> itself. Both the retry.dela

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
I may be missing something here but typically when the hive-site.xml configurations do not require you to place "s" within the configuration itself. Both the retry.delay and socket.timeout values are in seconds so you should only need to place the integer value (which are in seconds). On Sun Feb

Spark 1.2 and Mesos 0.21.0 spark.executor.uri issue?

2014-12-30 Thread Denny Lee
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the spark.executor.uri within spark-env.sh (and directly within bash as well), the Mesos slaves do not seem to be able to access the spark tgz file via HTTP or HDFS as per the message below. 14/12/30 15:57:35 INFO SparkILoop:

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
A follow up on the hive-site.xml, if you 1. Specify it in spark/conf, then you can NOT apply it via the --driver-class-path option, otherwise, you will get the following exceptions when initializing SparkContext. org.apache.spark.SparkException: Found both spark.driver.extraClassPath and

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
Hi All, I have tried to pass the properties via the SparkContext.setLocalProperty and HiveContext.setConf, both failed. Based on the results (haven't get a chance to look into the code yet), HiveContext will try to initiate the JDBC connection right away, I couldn't set other properties dynamica

Re: S3 files , Spark job hungsup

2014-12-23 Thread Denny Lee
You should be able to kill the job using the webUI or via spark-class. More info can be found in the thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html. HTH! On Tue, Dec 23, 2014 at 4:47 PM, durga wrote: > Hi All , > > It se

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee
Sorry Ted! I saw profile (-P) but missed the -D. My bad! On Fri, Dec 19, 2014 at 16:46 Ted Yu wrote: > Here is the command I used: > > mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 > -Dhadoop.version=2.6.0 -Phive -DskipTests > > FYI > > On Fri, Dec 19, 2014 at 4

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee
To clarify, there isn't a Hadoop 2.6 profile per se but you can build using -Dhadoop.version=2.4 which works with Hadoop 2.6. On Fri, Dec 19, 2014 at 12:55 Ted Yu wrote: > You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0 > > Cheers > > On Fri, Dec 19, 2014 at 12:51 PM, sa wrote: >

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
u suggest I run to test this? But more importantly, what > information would this give me? > > On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee wrote: >> >> Oh, it makes sense of gsutil scans through this quickly, but I was >> wondering if running a Hadoop job / bdutil would res

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
. See the > following. > > alex@hadoop-m:~/split$ time bash -c "gsutil ls > gs://my-bucket/20141205/csv/*/*/* | wc -l" > > 6860 > > real0m6.971s > user0m1.052s > sys 0m0.096s > > Alex > > > On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee wro

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark -> Hadoop -> GCS Connector -> GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta wrote: > All, > > I'm using the Spark shell to interact w

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
or and not a runtime > error -- I believe c is an array of values so I think you want > tabs.map(c => (c(167), c(110), c(200)) instead of tabs.map(c => (c._(167), > c._(110), c._(200)) > > > > On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee wrote: >> >> Yes - that work

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Yes - that works great! Sorry for implying I couldn't. Was just more flummoxed that I couldn't make the Scala call work on its own. Will continue to debug ;-) On Sun, Dec 14, 2014 at 11:39 Michael Armbrust wrote: > BTW, I cannot use SparkSQL / case right now because my table has 200 >> columns (a

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
ns looks like > the way to go given the context. What's not working? > > Kr, Gerard > On Dec 14, 2014 5:17 PM, "Denny Lee" wrote: > >> I have a large of files within HDFS that I would like to do a group by >> statement ala >> >> val table = sc

Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile("hdfs://") val tabs = table.map(_.split("\t")) I'm trying to do something similar to tabs.map(c => (c._(167), c._(110), c._(200)) where I create a new RDD that only has but that isn't

Re: Spark SQL Roadmap?

2014-12-13 Thread Denny Lee
Hi Xiaoyong, SparkSQL has already been released and has been part of the Spark code-base since Spark 1.0. The latest stable release is Spark 1.1 (here's the Spark SQL Programming Guide ) and we're currently voting on Spark 1.2. Hive

<    1   2   3   4   >