Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Pala M Muthaia
lt;yh...@databricks.com> wrote: > Hi Pala, > > Can you add the full stacktrace of the exception? For now, can you use > create temporary function to workaround the issue? > > Thanks, > > Yin > > On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia < > mchett...@rocke

Re: Hive permanent functions are not available in Spark SQL

2015-09-30 Thread Pala M Muthaia
+user list On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia <mchett...@rocketfuelinc.com > wrote: > Hi, > > I am trying to use internal UDFs that we have added as permanent functions > to Hive, from within Spark SQL query (using HiveContext), but i encounter > NoSuc

LogisticRegressionWithLBFGS with large feature set

2015-05-14 Thread Pala M Muthaia
Hi, I am trying to validate our modeling data pipeline by running LogisticRegressionWithLBFGS on a dataset with ~3.7 million features, basically to compute AUC. This is on Spark 1.3.0. I am using 128 executors with 4 GB each + driver with 8 GB. The number of data partitions is 3072 The

Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Pala M Muthaia
[INFO] | +- org.mindrot:jbcrypt:jar:0.3m:compile [INFO] | +- org.apache.thrift:libthrift:jar:0.7.0:compile [INFO] | | \- javax.servlet:servlet-api:jar:2.5:compile FYI On Thu, Mar 26, 2015 at 3:36 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, We are trying to build spark

Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Pala M Muthaia
Hi, We are trying to build spark 1.2 from source (tip of the branch-1.2 at the moment). I tried to build spark using the following command: mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package I encountered various missing class definition

Re: Issues with constants in Spark HiveQL queries

2015-01-28 Thread Pala M Muthaia
stacktrace do you get? Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Pala M Muthaia Date:01/19/2015 8:26 PM (GMT-05:00) To: Yana Kadiyska Cc: Cheng, Hao ,user@spark.apache.org Subject: Re: Issues with constants in Spark HiveQL queries Yes we

Re: Issues with constants in Spark HiveQL queries

2015-01-19 Thread Pala M Muthaia
, January 14, 2015 11:12 PM *To:* Pala M Muthaia *Cc:* user@spark.apache.org *Subject:* Re: Issues with constants in Spark HiveQL queries Just a guess but what is the type of conversion_aciton_id? I do queries over an epoch all the time with no issues(where epoch's type is bigint). You can

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Pala M Muthaia
a string it should be conversion_aciton_id=*'*20141210*' *(single quotes around the string) On Tue, Jan 13, 2015 at 5:25 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, We are testing Spark SQL-Hive QL, on Spark 1.2.0. We have run some simple queries successfully, but we hit

Re: OOM exception during row deserialization

2015-01-12 Thread Pala M Muthaia
Does anybody have insight on this? Thanks. On Fri, Jan 9, 2015 at 6:30 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I am using Spark 1.0.1. I am trying to debug a OOM exception i saw during a join step. Basically, i have a RDD of rows, that i am joining with another RDD

Broadcast joins on RDD

2015-01-12 Thread Pala M Muthaia
Hi, How do i do broadcast/map join on RDDs? I have a large RDD that i want to inner join with a small RDD. Instead of having the large RDD repartitioned and shuffled for join, i would rather send a copy of a small RDD to each task, and then perform the join locally. How would i specify this in

OOM exception during row deserialization

2015-01-09 Thread Pala M Muthaia
Hi, I am using Spark 1.0.1. I am trying to debug a OOM exception i saw during a join step. Basically, i have a RDD of rows, that i am joining with another RDD of tuples. Some of the tasks succeed but a fair number failed with OOM exception with stack below. The stack belongs to the 'reducer'

Re: Executor memory

2014-12-16 Thread Pala M Muthaia
://spark.apache.org/docs/latest/configuration.html 0.6 of 4GB is about 2.3GB. The note there is important, that you probably don't want to exceed the JVM old generation size with this parameter. On Tue, Dec 16, 2014 at 12:53 AM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, Running Spark

Re: Lost executors

2014-11-20 Thread Pala M Muthaia
. On Tue, Nov 18, 2014 at 5:01 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Sandy, Good point - i forgot about NM logs. When i looked up the NM logs, i only see the following statements that align with the driver side log about lost executor. Many executors show the same log

Lost executors

2014-11-18 Thread Pala M Muthaia
Hi, I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark shell. I am running a job that essentially reads a bunch of HBase keys, looks up HBase data, and performs some filtering and aggregation. The job works fine in smaller datasets, but when i try to execute on the full

Re: Lost executors

2014-11-18 Thread Pala M Muthaia
...@cloudera.com wrote: Hi Pala, Do you have access to your YARN NodeManager logs? Are you able to check whether they report killing any containers for exceeding memory limits? -Sandy On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I am using Spark

Re: Assigning input files to spark partitions

2014-11-17 Thread Pala M Muthaia
block size parameter? On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote No i don't want separate RDD because each of these partitions are being processed the same way (in my

Re: Assigning input files to spark partitions

2014-11-13 Thread Pala M Muthaia
them as separate RDDs. On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files

Assigning input files to spark partitions

2014-11-12 Thread Pala M Muthaia
Hi, I have a set of input files for a spark program, with each file corresponding to a logical data partition. What is the API/mechanism to assign each input file (or a set of files) to a spark partition, when initializing RDDs? When i create a spark RDD pointing to the directory of files, my

Re: Cannot instantiate hive context

2014-11-03 Thread Pala M Muthaia
/org.apache.thrift/libthrift/0.9.0 in the class path would resolve this issue. Thanks Best Regards On Sat, Nov 1, 2014 at 12:34 AM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I am trying to load hive datasets using HiveContext, in spark shell. Spark ver 1.0.1 and Hive ver 0.12