Executor errors out connecting to external shuffle service when using dynamic allocation

2016-10-07 Thread Manoj Samel
Resending with more clear subject. Any feedback ? On Tue, Oct 4, 2016 at 4:43 PM, Manoj Samel wrote: > Hi, > > On a secure hadoop cluster, spark shuffle is enabled (spark 1.6.0, shuffle > jar is spark-1.6.0-yarn-shuffle.jar). A client connecting using > spark-assembly_2.11

Any issues if spark 1.6.1 client connects to spark 1.6.0 external shuffle services

2016-10-04 Thread Manoj Samel
Hi, On a secure hadoop cluster, spark shuffle is enabled (spark 1.6.0, shuffle jar is spark-1.6.0-yarn-shuffle.jar). A client connecting using spark-assembly_2.11-1.6.1.jar gets errors starting executors, with following trace. Could this be due to spark version mismatch ? Any thoughts ? Thanks i

Spark 1.4 - memory bloat in group by/aggregate???

2015-06-26 Thread Manoj Samel
Hi, - Spark 1.4 on a single node machine. Run spark-shell - Reading from Parquet file with bunch of text columns and couple of amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100 million rows - rdd1 = sqlcontext.read.parquet - rdd1.cache - group_by_df =

Spark 1.3 saveAsTextFile with codec gives error - works with Spark 1.2

2015-04-15 Thread Manoj Samel
Env - Spark 1.3 Hadoop 2.3, Kerbeos xx.saveAsTextFile(path, codec) gives following trace. Same works with Spark 1.2 in same environment val codec = classOf[] val a = sc.textFile("/some_hdfs_file") a.saveAsTextFile("/some_other_hdfs_file", codec) fails with following trace in Spark 1.3, works i

park-assembly-1.3.0-hadoop2.3.0.jar has unsigned entries - org/apache/spark/SparkHadoopWriter$.class

2015-04-14 Thread Manoj Samel
With Spark 1.3 xx.saveAsTextFile(path, codec) gives following trace. Same works with Spark 1.2 Config is CDH 5.3.0 (Hadoop 2.3) with Kerberos 15/04/14 18:06:15 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 2.0 (TID 17) on executor node1078.svc.devpg.pdx.wd: java.lang.SecurityException (JC

Re: How to specify the port for AM Actor ...

2015-04-01 Thread Manoj Samel
Filed https://issues.apache.org/jira/browse/SPARK-6653 On Sun, Mar 29, 2015 at 8:18 PM, Shixiong Zhu wrote: > LGTM. Could you open a JIRA and send a PR? Thanks. > > Best Regards, > Shixiong Zhu > > 2015-03-28 7:14 GMT+08:00 Manoj Samel : > >> I looked @ the 1.3.0 code

Re: How to specify the port for AM Actor ...

2015-03-27 Thread Manoj Samel
hts? Any other place where any change is needed? On Wed, Mar 25, 2015 at 4:44 PM, Shixiong Zhu wrote: > There is no configuration for it now. > > Best Regards, > Shixiong Zhu > > 2015-03-26 7:13 GMT+08:00 Manoj Samel : > >> There may be firewall rules limiting the p

Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Manoj Samel
While looking into a issue, I noticed that the source displayed on Github site does not matches the downloaded tar for 1.3 Thoughts ?

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
s, since multiple AMs can run in > the same machine. Why do you need a fixed port? > > Best Regards, > Shixiong Zhu > > 2015-03-26 6:49 GMT+08:00 Manoj Samel : > >> Spark 1.3, Hadoop 2.5, Kerbeors >> >> When running spark-shell in yarn client mode, it shows fol

How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
Spark 1.3, Hadoop 2.5, Kerbeors When running spark-shell in yarn client mode, it shows following message with a random port every time (44071 in example below). Is there a way to specify that port to a specific port ? It does not seem to be part of ports specified in http://spark.apache.org/docs/l

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: "e04"

2015-03-24 Thread Manoj Samel
Thanks Marcelo - I was using the SBT built spark per earlier thread. I switched now to the distro (with the conf changes for CDH path in front) and guava issue is gone. Thanks, On Tue, Mar 24, 2015 at 1:50 PM, Marcelo Vanzin wrote: > Hi there, > > On Tue, Mar 24, 2015 at 1:40 PM, Ma

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: "e04"

2015-03-24 Thread Manoj Samel
op 2.5.0, one of which addresses this parsing trouble). > > You do not require to recompile Spark, just alter its hadoop libraries in > its classpath to be that of CDH server version (overwrite from parcels, > etc.). > > On Wed, Mar 25, 2015 at 1:06 AM, Manoj Samel > wrote: > >&

Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Manoj Samel
http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn does not list hadoop 2.5 in Hadoop version table table etc. I assume it is still OK to compile with -Pyarn -Phadoop-2.5 for use with Hadoop 2.5 (cdh 5.3.2) Thanks,

Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: "e04"

2015-03-23 Thread Manoj Samel
Spark 1.3, CDH 5.3.2, Kerberos Setup works fine with base configuration, spark-shell can be used in yarn client mode etc. When work recovery feature is enabled via http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_ha_yarn_work_preserving_recovery.html, the spark-s

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) On Mon, Mar 23, 2015 at 2:25 PM, Marcelo Vanzin wrote: > On Mon, Mar 23, 2015 at 2:15 PM, Manoj Samel > wrote: > > Found the issue above error - the setting for spark_shuffle was > incomplete. > > > > Now it is abl

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
at 6:51 AM, Ted Yu wrote: > bq. Requesting 1 new executor(s) because tasks are backlogged > > 1 executor was requested. > > Which hadoop release are you using ? > > Can you check resource manager log to see if there is some clue ? > > Thanks > > On Fri, Mar 20,

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-20 Thread Manoj Samel
Forgot to add - the cluster is idle otherwise so there should be no resource issues. Also the configuration works when not using Dynamic allocation. On Fri, Mar 20, 2015 at 4:15 PM, Manoj Samel wrote: > Hi, > > Running Spark 1.3 with secured Hadoop. > > Spark-shell with Yarn c

Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-20 Thread Manoj Samel
Hi, Running Spark 1.3 with secured Hadoop. Spark-shell with Yarn client mode runs without issue when not using Dynamic Allocation. When Dynamic allocation is turned on, the shell comes up but same SQL etc. causes it to loop. spark.dynamicAllocation.enabled=true spark.dynamicAllocation.initialEx

Dataframe v/s SparkSQL

2015-03-02 Thread Manoj Samel
Is it correct to say that Spark Dataframe APIs are implemented using same execution as SparkSQL ? In other words, while the dataframe API is different than SparkSQL, the runtime performance of equivalent constructs in Dataframe and SparkSQL should be same. So one should be able to choose whichever

New ColumnType For Decimal Caching

2015-02-13 Thread Manoj Samel
la/org/apache/spark/sql/columnar/ColumnType.scala> > . > > PRs welcome :) > > On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel > wrote: > >> Hi Michael, >> >> As a test, I have same data loaded as another parquet - except with the 2 >> decimal(14,4) replaced

Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Manoj Samel
d...@spark.apache.org mentioned on http://spark.apache.org/community.html seems to be bouncing. Is there another one ?

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
store in-memory decimal in some form of long with decoration ? For the immediate future, is there any hook that we can use to provide custom caching / processing for the decimal type in RDD so other semantic does not changes ? Thanks, On Mon, Feb 9, 2015 at 2:41 PM, Manoj Samel wrote: > Co

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
in the in-memory columnar > storage, so you are paying expensive serialization there likely. > > On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel > wrote: > >> Flat data of types String, Int and couple of decimal(14,4) >> >> On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Flat data of types String, Int and couple of decimal(14,4) On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust wrote: > Is this nested data or flat data? > > On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel > wrote: > >> Hi Michael, >> >> The storage tab shows th

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
uffers in addition to reading the data off of > the disk. > > On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel > wrote: > >> Spark 1.2 >> >> Data stored in parquet table (large number of rows) >> >> Test 1 >> >> select a, sum(b), sum(c) from ta

SQL group by on Parquet table slower when table cached

2015-02-06 Thread Manoj Samel
Spark 1.2 Data stored in parquet table (large number of rows) Test 1 select a, sum(b), sum(c) from table Test sqlContext.cacheTable() select a, sum(b), sum(c) from table - "seed cache" First time slow since loading cache ? select a, sum(b), sum(c) from table - Second time it should be faster

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
ility Thanks On Wed, Feb 4, 2015 at 4:09 PM, Manoj Samel wrote: > Awesome ! By setting this, I could minimize the collect overhead, e.g by > setting it to # of partitions of the RDD. > > Two questions > > 1. I had looked for such option in > http://spark.apache.org/docs/latest/c

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
, 2015 at 12:38 PM, Manoj Samel wrote: > Spark 1.2 > Data is read from parquet with 2 partitions and is cached as table with 2 > partitions. Verified in UI that it shows RDD with 2 partitions & it is > fully cached in memory > > Cached data contains column a, b, c. Column a ha

Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Spark 1.2 Data is read from parquet with 2 partitions and is cached as table with 2 partitions. Verified in UI that it shows RDD with 2 partitions & it is fully cached in memory Cached data contains column a, b, c. Column a has ~150 distinct values. Next run SQL on this table as "select a, sum(b)

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-03 Thread Manoj Samel
Hi, Any thoughts ? Thanks, On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel wrote: > Spark 1.2 > > SchemaRDD has schema with decimal columns created like > > x1 = new StructField("a", DecimalType(14,4), true) > > x2 = new StructField("b", DecimalType(14,4)

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
decimal So it seems schemaRDD.coalesce returns a RDD whose schema does not matches the source RDD in that decimal type seem to get changed. Any thoughts ? Is this a bug ??? Thanks, On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel wrote: > Spark 1.2 > > SchemaRDD has schema with decimal column

Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
Spark 1.2 SchemaRDD has schema with decimal columns created like x1 = new StructField("a", DecimalType(14,4), true) x2 = new StructField("b", DecimalType(14,4), true) Registering as SQL Temp table and doing SQL queries on these columns , including SUM etc. works fine, so the schema Decimal does

Why is DecimalType separate from DataType ?

2015-01-30 Thread Manoj Samel
Spark 1.2 While building schemaRDD using StructType xxx = new StructField("credit_amount", DecimalType, true) gives error "type mismatch; found : org.apache.spark.sql.catalyst.types.DecimalType.type required: org.apache.spark.sql.catalyst.types.DataType" From https://spark.apache.org/docs/1.2.0/

schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Manoj Samel
Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large number of small (~1MB ) parquet part-x- files. Any way to control so that smaller number of large files are created ? Thanks,

SparkSQL Performance Tuning Options

2015-01-27 Thread Manoj Samel
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option to cache data and also pre-compute some results sets, hash maps etc. that would be lik

Re: spark 1.2 - Writing parque fails for timestamp with "Unsupported datatype TimestampType"

2015-01-26 Thread Manoj Samel
Awesome ! That would be great !! On Mon, Jan 26, 2015 at 3:18 PM, Michael Armbrust wrote: > I'm aiming for 1.3. > > On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel > wrote: > >> Thanks Michael. I am sure there have been many requests for this support. >>

Re: spark 1.2 - Writing parque fails for timestamp with "Unsupported datatype TimestampType"

2015-01-26 Thread Manoj Samel
> > However, there is a PR to add support using parquets INT96 type: > https://github.com/apache/spark/pull/3820 > > On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel > wrote: > >> Looking further at the trace and ParquetTypes.scala, it seems there is no >> support for

Re: spark 1.2 - Writing parque fails for timestamp with "Unsupported datatype TimestampType"

2015-01-23 Thread Manoj Samel
/LogicalTypes.md), any reason why Date / Timestamp are not supported right now ? Thanks, Manoj On Fri, Jan 23, 2015 at 11:40 AM, Manoj Samel wrote: > Using Spark 1.2 > > Read a CSV file, apply schema to convert to SchemaRDD and then > schemaRdd.saveAsParquetFile > > If t

spark 1.2 - Writing parque fails for timestamp with "Unsupported datatype TimestampType"

2015-01-23 Thread Manoj Samel
Using Spark 1.2 Read a CSV file, apply schema to convert to SchemaRDD and then schemaRdd.saveAsParquetFile If the schema includes Timestamptype, it gives following trace when doing the save Exception in thread "main" java.lang.RuntimeException: Unsupported datatype TimestampType at scala.sys.pa

Error when running SparkPi on Secure HA Hadoop cluster

2015-01-15 Thread Manoj Samel
Hi, Setup is as follows Hadoop Cluster 2.3.0 (CDH5.0) - Namenode HA - Resource manager HA - Secured with Kerberos Spark 1.2 Run SparkPi as follows - conf/spark-defaults.conf has following entries spark.yarn.queue myqueue spark.yarn.access.namenodes hdfs://namespace (remember this is namenode HA

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
7;re logged in (i.e. you've run kinit), everything should > just work. You can run "klist" to make sure you're logged in. > > On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel > wrote: > > Hi, > > > > For running spark 1.2 on Hadoop cluster with Kerberos, what

Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Hi, For running spark 1.2 on Hadoop cluster with Kerberos, what spark configurations are required? Using existing keytab, can any examples be submitted to the secured cluster ? How? Thanks,

Cannot see RDDs in Spark UI

2015-01-06 Thread Manoj Samel
Hi, I create a bunch of RDDs, including schema RDDs. When I run the program and go to UI on xxx:4040, the storage tab does not shows any RDDs. Spark version is 1.1.1 (Hadoop 2.3) Any thoughts? Thanks,

Sharing sqlContext between Akka router and "routee" actors ...

2014-12-18 Thread Manoj Samel
Hi, Akka router creates a sqlContext and creates a bunch of "routees" actors with sqlContext as parameter. The actors then execute query on that sqlContext. Would this pattern be a issue ? Any other way sparkContext etc. should be shared cleanly in Akka routers/routees ? Thanks,

Re: Spark Server - How to implement

2014-12-12 Thread Manoj Samel
your needs. > > > > We've been playing with something like that inside Hive, though: > > > > On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel > wrote: > >> Hi, > >> > >> If spark based services are to be exposed as a continuously availab

Spark Server - How to implement

2014-12-11 Thread Manoj Samel
Hi, If spark based services are to be exposed as a continuously available server, what are the options? * The API exposed to client will be proprietary and fine grained (RPC style ..), not a Job level API * The client API need not be SQL so the Thrift JDBC server does not seem to be option .. but

Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Manoj Samel
I am using SQLContext.jsonFile. If a valid JSON contains newlines, spark1.1.1 dumps trace below. If the JSON is read as one line, it works fine. Is this known? 14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 28) com.fasterxml.jackson.core.JsonParseException: Unexpected

Can HiveContext be used without using Hive?

2014-12-09 Thread Manoj Samel
>From 1.1.1 documentation, it seems one can use HiveContext instead of SQLContext without having a Hive installation. The benefit is richer SQL dialect. Is my understanding correct ? Thanks

Spark SQL - Any time line to move beyond Alpha version ?

2014-11-24 Thread Manoj Samel
Is there any timeline where Spark SQL goes beyond alpha version? Thanks,

Re: Spark resilience

2014-04-15 Thread Manoj Samel
a some sort of state checkpointing into a > globally visible storage system (e.g., HDFS), which, for example, Spark > Streaming already does. > > Currently, this feature is not supported in YARN or Mesos fine-grained > mode. > > > On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel

Re: Spark resilience

2014-04-14 Thread Manoj Samel
Could you please elaborate how drivers can be restarted automatically ? Thanks, On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson wrote: > Master and slave are somewhat overloaded terms in the Spark ecosystem (see > the glossary: > http://spark.apache.org/docs/latest/cluster-overview.html#gloss

Re: groupBy RDD does not have grouping column ?

2014-03-31 Thread Manoj Samel
ill need to include 'a in the > second parameter list (which is similar to the SELECT clause) as well if > you want it included in the output. > > > On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel wrote: > >> Hi, >> >> If I create a groupBy('a)(Sum('

Re: Error in SparkSQL Example

2014-03-31 Thread Manoj Samel
the expected type of the variable 'people'. Perhaps there is > a clearer way to indicate this. > > As you have realized, using the full line from the first example will > allow you to run the rest of them. > > > > On Sun, Mar 30, 2014 at 7:31 AM, Manoj Samel wrote:

Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
wncasting. > > > On Sun, Mar 30, 2014 at 7:56 AM, Manoj Samel wrote: > >> Hi, >> >> I am trying SparkSQL based on the example on doc ... >> >> >> >> val people = >> sc.textFile("/data/spark/examples/src/main/resources/pe

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Manoj Samel
> > > > On Sun, Mar 30, 2014 at 2:46 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> This is a great question. We are in the same position, having not >> invested in Hive yet and looking at various options for SQL-on-Hadoop. >> >> >

groupBy RDD does not have grouping column ?

2014-03-30 Thread Manoj Samel
Hi, If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the resulting RDD should have 'a, 'foo and 'bar. The result RDD just shows 'foo and 'bar and is missing 'a Thoughts? Thanks, Manoj

Re: SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
4 at 10:43 AM, smallmonkey...@hotmail.com < > smallmonkey...@hotmail.com> wrote: > >> can I get the whole operation? then i can try to locate the error >> >> -- >> smallmonkey...@hotmail.com >> >> *From:* Manoj Samel >>

SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Double works ... scala> case class JournalLine(account: String, credit: BigDecimal, debit: BigDecimal, date: String, company: String, currency: String, costcenter: String, region: String) defined class JournalLine

Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi, I am trying SparkSQL based on the example on doc ... val people = sc.textFile("/data/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) val olderThanTeans = people.where('age > 19) val youngerThanTeans = people.where('age < 13) val

Error in SparkSQL Example

2014-03-30 Thread Manoj Samel
Hi, On http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html, I am trying to run code on "Writing Language-Integrated Relational Queries" ( I have 1.0.0 Snapshot ). I am running into error on val people: RDD[Person] // An RDD of case class objects, from the first example.

SQL on Spark - Shark or SparkSQL

2014-03-29 Thread Manoj Samel
Hi, In context of the recent Spark SQL announcement ( http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html ). If there is no existing investment in Hive/Shark, would it be worth starting a new SQL work using SparkSQL rather than Shark ? * It seems Shark S