from:"Yana Kadiyska"

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Yana Kadiyska

One thing to note, if you are using Mesos, is that the version of Mesos
changed from 0.21 to 1.0.0. So taking a newer Spark might push you into
larger infrastructure upgrades

On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D 
wrote:

> Hello All,
>
> Currently our Batch ETL Jobs are in Spark 1.6.0 and planning to upgrade
> into Spark 2.1.0.
>
> With minor code changes (like configuration and Spark Session.sc) able to
> execute the existing JOB into Spark 2.1.0.
>
> But noticed that JOB completion timings are much better in Spark 1.6.0 but
> no in Spark 2.1.0.
>
> For the instance, JOB A completed in 50s in Spark 1.6.0.
>
> And with the same input and JOB A completed in 1.5 mins in Spark 2.1.0.
>
> Is there any specific factor needs to be considered when switching to
> Spark 2.1.0 from Spark 1.6.0.
>
>
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>

HiveThriftserver does not seem to respect partitions

2017-09-13 Thread Yana Kadiyska

Hi folks, I have created a table in the following manner:

CREATE EXTERNAL TABLE IF NOT EXISTS  rum_beacon_partition (
  list of columns

)

COMMENT 'User Infomation'

PARTITIONED BY (account_id String,

product String,

group_id String,

year String,

month String,

day String)

STORED AS PARQUET

LOCATION '/stream2store/nt_tp_collation'


I then ran MSCK REPAIR TABLE to generate the partition information.


I think partitions got generated correctly -- here is a query and it's
output:


"show table extended like 'rum_beacon_partition'
partition(account_id='',product='rum',group_id='',year='2017',month='09',day='12')


 location:ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation
/account_id=/product=rum/group_id= /year=2017/month=09/day=12

However, it does appear that when I issue a SQL query, the predicates do
not correctly limit the files touched:

explain extended select uri from rum_beacon_partition where
account_id='' and product='rum' and group_id='' and year='2017' and
month='09' limit 2

Produces output that seems to indicate that every file is being touched
(unless I'm misreading the output). It also crashes my filesystem so I
suspect there is some truth to it.

Optimized logical plan looks fine I think:

== Optimized Logical Plan == |

| Limit 2 |

| Project [uri#16519] |

| Filter (account_id#16511 = ) && (product#16512 = rum)) &&
(group_id#16513 = )) && (year#16514 = 2017)) && (month#16515 = 09)) |


But in the physical plan it seems that a ton of files are touched (both in
account and date partitions)

Scan
ParquetRelation[ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=16,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=17,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=18,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=19,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=20,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=21,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=22,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=23,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=24,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=25,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=26,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=27,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=28,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=29,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=30,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=05/day=31,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=06/day=01,ceph://ceph-mon.service.consul:6789/stream2store/nt_tp_collation/account_id=/product=rum/group_id=/year=2017/month=06/day=02


I am hoping someone can offer debugging tips / advice on what to look for
in the logs. I'm on a pretty old version of Spark (1.5.2) but this seems
like something that I'm doing wrong.

Trouble with Thriftserver with hsqldb (Spark 2.1.0)

2017-03-06 Thread Yana Kadiyska

Hi folks, trying to run Spark 2.1.0 thrift server against an hsqldb file
and it seems to...hang.

I am starting thrift server with:

sbin/start-thriftserver.sh --driver-class-path ./conf/hsqldb-2.3.4.jar ,
completely local setup


hive-site.xml is like this:






  

hive.metastore.warehouse.dir

/tmp

  

  

javax.jdo.option.ConnectionURL

jdbc:hsqldb:file:/tmp/hive-metastore

JDBC connect string for a JDBC metastore

  


  

 javax.jdo.option.ConnectionDriverName

 org.hsqldb.jdbc.JDBCDriver

  

  

  javax.jdo.option.ConnectionUserName

  SA



  

javax.jdo.option.ConnectionPassword



  

  

datanucleus.autoCreateSchema

true

  




I have turned on logging with

log4j.category.DataNucleus=ALL

log4j.logger.org.apache.spark.sql.hive.thriftserver=ALL

and my spark log seems stuck at:

17/03/06 16:42:46 DEBUG Schema: Schema Transaction started with connection
"com.jolbox.bonecp.ConnectionHandle@44fdce3c" with isolation "serializable"

17/03/06 16:42:46 DEBUG Schema: Check of existence of CDS returned no table

17/03/06 16:42:46 DEBUG Schema: Creating table CDS

17/03/06 16:42:46 DEBUG Schema: CREATE TABLE CDS

(

CD_ID BIGINT NOT NULL,

CONSTRAINT CDS_PK PRIMARY KEY (CD_ID)

)


Hoping someone an suggest what other logging I can turn on or what a
possible issue can be (nothing is listening on port 1 at this point,
java process did open 4040)

[Thriftserver2] Controlling number of tasks

2016-08-03 Thread Yana Kadiyska

Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When
spark reads these files, I end up with 315K tasks for a dataframe reading a
few days worth of data.

I now with a regular Spark job, I can use coalesce to come to a lower
number of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I
have a line in hive-conf that says to use CombinedInputFormat but I'm not
sure it's working.

(Obviously haivng fewer large files is better but I don't control the file
generation side of this)

Tips much appreciated

Re: 101 question on external metastore

2016-01-14 Thread Yana Kadiyska

If you have a second could you post the version of derby that you
installed, the contents of  hive-site.xml and the command you use to run
(along with spark version?). I'd like to retry the installation.

On Thu, Jan 7, 2016 at 7:35 AM, Deenar Toraskar <deenar.toras...@gmail.com>
wrote:

> I sorted this out. There were 2 different version of derby and ensuring
> the metastore and spark used the same version of Derby made the problem go
> away.
>
> Deenar
>
> On 6 January 2016 at 02:55, Yana Kadiyska <yana.kadiy...@gmail.com> wrote:
>
>> Deenar, I have not resolved this issue. Why do you think it's from
>> different versions of Derby? I was playing with this as a fun experiment
>> and my setup was on a clean machine -- no other versions of
>> hive/hadoop/etc...
>>
>> On Sun, Dec 20, 2015 at 12:17 AM, Deenar Toraskar <
>> deenar.toras...@gmail.com> wrote:
>>
>>> apparently it is down to different versions of derby in the classpath,
>>> but i am unsure where the other version is coming from. The setup worked
>>> perfectly with spark 1.3.1.
>>>
>>> Deenar
>>>
>>> On 20 December 2015 at 04:41, Deenar Toraskar <deenar.toras...@gmail.com
>>> > wrote:
>>>
>>>> Hi Yana/All
>>>>
>>>> I am getting the same exception. Did you make any progress?
>>>>
>>>> Deenar
>>>>
>>>> On 5 November 2015 at 17:32, Yana Kadiyska <yana.kadiy...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi folks, trying experiment with a minimal external metastore.
>>>>>
>>>>> I am following the instructions here:
>>>>> https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode
>>>>>
>>>>> I grabbed Derby 10.12.1.1 and started an instance, verified I can
>>>>> connect via ij tool and that process is listening on 1527
>>>>>
>>>>> put the following hive-site.xml under conf
>>>>> ```
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>   javax.jdo.option.ConnectionURL
>>>>>   jdbc:derby://localhost:1527/metastore_db;create=true
>>>>>   JDBC connect string for a JDBC metastore
>>>>> 
>>>>> 
>>>>>   javax.jdo.option.ConnectionDriverName
>>>>>   org.apache.derby.jdbc.ClientDriver
>>>>>   Driver class name for a JDBC metastore
>>>>> 
>>>>> 
>>>>> ```
>>>>>
>>>>> I then try to run spark-shell thusly:
>>>>> bin/spark-shell --driver-class-path
>>>>> /home/yana/db-derby-10.12.1.1-bin/lib/derbyclient.jar
>>>>>
>>>>> and I get an ugly stack trace like so...
>>>>>
>>>>> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
>>>>> org.apache.derby.jdbc.EmbeddedDriver
>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>> Method)
>>>>> at
>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>> at
>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>> at java.lang.Class.newInstance(Class.java:379)
>>>>> at
>>>>> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>>>>> at
>>>>> org.datanucleus.store.rdbms.connectionpool.DBCPConnectionPoolFactory.createConnectionPool(DBCPConnectionPoolFactory.java:50)
>>>>> at
>>>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>>>>> at
>>>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>>>>> at
>>>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>>>>> ... 114 more
>>>>>
>>>>> :10: error: not found: value sqlContext
>>>>>import sqlContext.implicits._
>>>>>
>>>>>
>>>>> What am I doing wrong -- not sure why it's looking for Embedded
>>>>> anything, I'm specifically trying to not use the embedded server...but I
>>>>> know my hive-site is being read as starting witout --driver-class-path 
>>>>> does
>>>>> say it can't load org.apache.derby.jdbc.ClientDriver
>>>>>
>>>>
>>>>
>>>
>>
>

Re: 101 question on external metastore

2016-01-05 Thread Yana Kadiyska

Deenar, I have not resolved this issue. Why do you think it's from
different versions of Derby? I was playing with this as a fun experiment
and my setup was on a clean machine -- no other versions of
hive/hadoop/etc...

On Sun, Dec 20, 2015 at 12:17 AM, Deenar Toraskar <deenar.toras...@gmail.com
> wrote:

> apparently it is down to different versions of derby in the classpath, but
> i am unsure where the other version is coming from. The setup worked
> perfectly with spark 1.3.1.
>
> Deenar
>
> On 20 December 2015 at 04:41, Deenar Toraskar <deenar.toras...@gmail.com>
> wrote:
>
>> Hi Yana/All
>>
>> I am getting the same exception. Did you make any progress?
>>
>> Deenar
>>
>> On 5 November 2015 at 17:32, Yana Kadiyska <yana.kadiy...@gmail.com>
>> wrote:
>>
>>> Hi folks, trying experiment with a minimal external metastore.
>>>
>>> I am following the instructions here:
>>> https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode
>>>
>>> I grabbed Derby 10.12.1.1 and started an instance, verified I can
>>> connect via ij tool and that process is listening on 1527
>>>
>>> put the following hive-site.xml under conf
>>> ```
>>> 
>>> 
>>> 
>>> 
>>>   javax.jdo.option.ConnectionURL
>>>   jdbc:derby://localhost:1527/metastore_db;create=true
>>>   JDBC connect string for a JDBC metastore
>>> 
>>> 
>>>   javax.jdo.option.ConnectionDriverName
>>>   org.apache.derby.jdbc.ClientDriver
>>>   Driver class name for a JDBC metastore
>>> 
>>> 
>>> ```
>>>
>>> I then try to run spark-shell thusly:
>>> bin/spark-shell --driver-class-path
>>> /home/yana/db-derby-10.12.1.1-bin/lib/derbyclient.jar
>>>
>>> and I get an ugly stack trace like so...
>>>
>>> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
>>> org.apache.derby.jdbc.EmbeddedDriver
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at java.lang.Class.newInstance(Class.java:379)
>>> at
>>> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>>> at
>>> org.datanucleus.store.rdbms.connectionpool.DBCPConnectionPoolFactory.createConnectionPool(DBCPConnectionPoolFactory.java:50)
>>> at
>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>>> at
>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>>> at
>>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>>> ... 114 more
>>>
>>> :10: error: not found: value sqlContext
>>>import sqlContext.implicits._
>>>
>>>
>>> What am I doing wrong -- not sure why it's looking for Embedded
>>> anything, I'm specifically trying to not use the embedded server...but I
>>> know my hive-site is being read as starting witout --driver-class-path does
>>> say it can't load org.apache.derby.jdbc.ClientDriver
>>>
>>
>>
>

Re: HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska

Nope, in my case I see it pretty much as soon as the beeline client issues
the connect statement. Queries I've run so far are of the "show table"
variety and also "select count(*) from mytable" -- i.e. nothing that
serializes large amounts of data


On Thu, Nov 12, 2015 at 7:44 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> OOM can occurs in any place, if most of memory is used by some of the
> `defect`, the exception stack probably doesn’t show the real problem.
>
>
>
> Most of the this occurs in ThriftServer as I know, people are trying to
> collect a huge result set, can you confirm that? If it fall into this
> category, probably you can set the
> “spark.sql.thriftServer.incrementalCollect” to false;
>
>
>
> Hao
>
>
>
> *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
> *Sent:* Friday, November 13, 2015 8:30 AM
> *To:* user@spark.apache.org
> *Subject:* HiveServer2 Thrift OOM
>
>
>
> Hi folks, I'm starting a HiveServer2 from a HiveContext
> (HiveThriftServer2.startWithContext(hiveContext))  and then connecting to
> it via beenline. On the server side, I see the below error  which I think
> is related to https://issues.apache.org/jira/browse/HIVE-6468
>
>
>
> But I'd like to know:
>
>
>
> 1. why I see it (I'm using bin/beeline from spark to connect, not http)
>
> 2. should I be dropping any hive-site or hive-default files in conf/ --
> Hive-6468 talks about *hive.server2.sasl.message.limit *but I can't see
> any documentation on where this setting would go or what's a reasonable
> value (im trying to do a light-weight deployment and have not needed
> hive-site.xml so far...)
>
>
>
> Advice on how to get rid of the below exception much appreciated
>
>
>
>
>
> Exception in thread "pool-17-thread-2" java.lang.OutOfMemoryError: Java heap 
> space
>
> at 
> org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:181)
>
> at 
> org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
>
> at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
>
> at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>
> at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
> E
>
> 
>

HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska

Hi folks, I'm starting a HiveServer2 from a HiveContext
(HiveThriftServer2.startWithContext(hiveContext))  and then connecting to
it via beenline. On the server side, I see the below error  which I think
is related to https://issues.apache.org/jira/browse/HIVE-6468

But I'd like to know:

1. why I see it (I'm using bin/beeline from spark to connect, not http)
2. should I be dropping any hive-site or hive-default files in conf/ --
Hive-6468 talks about *hive.server2.sasl.message.limit *but I can't see any
documentation on where this setting would go or what's a reasonable value
(im trying to do a light-weight deployment and have not needed
hive-site.xml so far...)

Advice on how to get rid of the below exception much appreciated


Exception in thread "pool-17-thread-2" java.lang.OutOfMemoryError:
Java heap space
at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:181)
at 
org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
E

101 question on external metastore

2015-11-05 Thread Yana Kadiyska

Hi folks, trying experiment with a minimal external metastore.

I am following the instructions here:
https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode

I grabbed Derby 10.12.1.1 and started an instance, verified I can connect
via ij tool and that process is listening on 1527

put the following hive-site.xml under conf
```




  javax.jdo.option.ConnectionURL
  jdbc:derby://localhost:1527/metastore_db;create=true
  JDBC connect string for a JDBC metastore


  javax.jdo.option.ConnectionDriverName
  org.apache.derby.jdbc.ClientDriver
  Driver class name for a JDBC metastore


```

I then try to run spark-shell thusly:
bin/spark-shell --driver-class-path
/home/yana/db-derby-10.12.1.1-bin/lib/derbyclient.jar

and I get an ugly stack trace like so...

Caused by: java.lang.NoClassDefFoundError: Could not initialize class
org.apache.derby.jdbc.EmbeddedDriver
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:379)
at
org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
at
org.datanucleus.store.rdbms.connectionpool.DBCPConnectionPoolFactory.createConnectionPool(DBCPConnectionPoolFactory.java:50)
at
org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
at
org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
at
org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
... 114 more

:10: error: not found: value sqlContext
   import sqlContext.implicits._


What am I doing wrong -- not sure why it's looking for Embedded anything,
I'm specifically trying to not use the embedded server...but I know my
hive-site is being read as starting witout --driver-class-path does say it
can't load org.apache.derby.jdbc.ClientDriver

Re: Subtract on rdd2 is throwing below exception

2015-11-05 Thread Yana Kadiyska

subtract is not the issue. Spark is lazy so a lot of times you'd have many,
many lines of code which does not in fact run until you do some action (in
your case, subtract). As you can see from the stacktrace, the NPE is from
joda which is used in the partitioner (Im suspecting in Cassandra).But the
short story is that the exception is in one of the many RDDs you're chaining

What I would suggest is that you force evaluation on your RDDs for
debugging purposes -- e.g RDD2, mappedRDD, CassandraJoinRDD,
subtractedRDD...I'd
try to take a count on all of these and see if you can flush out the issue.
Another place to look is your cassandra table -- my guess would be that you
use time as part of a partition key somewhere and the data in the field
you're using is no good...

On Thu, Nov 5, 2015 at 8:32 AM, Priya Ch 
wrote:

> Hi All,
>
>
>  I am seeing exception when trying to substract 2 rdds.
>
>  Lets say rdd1 has messages like -
>
> *  pnr,  bookingId,  BookingObject*
>  101,   1,   BookingObject1 // - event number is 0
>  102,   1,   BookingObject2// - event number is 0
>  103,   2,   BookingObject3//-event number is  1
>
> rdd1 looks like RDD1[(String,Int,Booking)].
>
> Booking table in Cassandra has primary key as pnr and bookingId.
> Lets say Booking table has following rows-
>
> *pnr,  bookingId, eventNumber*
> Row1 -  101,   1,  1
> Row2 -  103,   2,  0
>
> RDD1.joinWithCassandraTable on columns pnr and bookingId with Booking
> table is giving me the following CassandraJoinRDD -
>
> (101, 1, BookingObject1), Row1
> (103, 2, BookingObject3), Row2
>
> Now on this rdd, I am comparing event number of BookinObject against
> eventNumber in the row and filter the messages whose eventNUmber is greater
> than that of in the row - which gives the following Rdd
>
> val RDD2:RDD[(String,Int,BookingObject), CassandraRow] contains the below
> record
>
> (102, 2, BookingObject3), Row2.
>
> But I also need pnr 102 from the original rdd as it is not existing in DB.
> Hence to get such messages - I am CassandraJoinRDD from original RDD i.e
> RDD1 as
>
> val mappedCRdd= CassandraJoinRDD.map{case(tuple, row) => tuple}
> subtractedRdd= RDD1.subtract(mappedCRdd)
>
>
>
> val mappedRdd2 = RDD2.map{case(tuple, row) => tuple}
>
> Now I am doing union this subtractedRdd with mappedRdd2 as
> subtractedRdd.union(mappedRdd2 )
>
> But subtract on Rdd is throwing below exception -
>
>
> java.lang.NullPointerException
>   at org.joda.time.LocalDateTime.getValue(LocalDateTime.java:566)
>   at org.joda.time.base.AbstractPartial.hashCode(AbstractPartial.java:282)
>   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
>   at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63)
>   at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210)
>   at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172)
>   at com.amadeus.ti.models.tof.TOFModel$GCAS.hashCode(TOFModel.scala:14)
>   at scala.runtime.ScalaRunTime$.hash(ScalaRunTime.scala:210)
>   at scala.util.hashing.MurmurHash3.productHash(MurmurHash3.scala:63)
>   at scala.util.hashing.MurmurHash3$.productHash(MurmurHash3.scala:210)
>   at scala.runtime.ScalaRunTime$._hashCode(ScalaRunTime.scala:172)
>   at com.amadeus.ti.models.tof.TOFModel$TAOFRS.hashCode(TOFModel.scala:7)
>   at java.util.HashMap.hash(HashMap.java:362)
>   at java.util.HashMap.put(HashMap.java:492)
>   at org.apache.spark.rdd.SubtractedRDD.org 
> $apache$spark$rdd$SubtractedRDD$$getSeq$1(SubtractedRDD.scala:104)
>   at 
> org.apache.spark.rdd.SubtractedRDD$$anonfun$compute$1.apply(SubtractedRDD.scala:119)
>   at 
> org.apache.spark.rdd.SubtractedRDD$$anonfun$compute$1.apply(SubtractedRDD.scala:119)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.SubtractedRDD.integrate$1(SubtractedRDD.scala:116)
>   at org.apache.spark.rdd.SubtractedRDD.compute(SubtractedRDD.scala:119)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at

how to merge two dataframes

2015-10-30 Thread Yana Kadiyska

Hi folks,

I have a need to "append" two dataframes -- I was hoping to use UnionAll
but it seems that this operation treats the underlying dataframes as
sequence of columns, rather than a map.

In particular, my problem is that the columns in the two DFs are not in the
same order --notice that my customer_id somehow comes out a string:

This is Spark 1.4.1

case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
val test = Test(1234l,"firefox",999,"http://foobar;)

case class Test1( customer_id :Int,uri:String,browser:String,
 epoch :Long)
val test1 = Test1(888,"http://foobar","ie",12343)
val df=sc.parallelize(Seq(test)).toDF
val df1=sc.parallelize(Seq(test1)).toDF
df.unionAll(df1)

//res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser:
string, customer_id: string, uri: string]



Is unionAll the wrong operation? Any special incantations? Or advice on how
to otherwise get this to succeeed?

Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska

Not a bad idea I suspect but doesn't help me. I dumbed down the repro to
ask for help. In reality one of my dataframes is a cassandra DF.
So cassDF.registerTempTable("df1") registers the temp table in a different
SQL Context (new CassandraSQLContext(sc)).


scala> sql("select customer_id, uri, browser, epoch from df union all
select customer_id, uri, browser, epoch from df1").show()
org.apache.spark.sql.AnalysisException: no such table df1; line 1 pos 103
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:233)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)


On Fri, Oct 30, 2015 at 3:34 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> How about the following ?
>
> scala> df.registerTempTable("df")
> scala> df1.registerTempTable("df1")
> scala> sql("select customer_id, uri, browser, epoch from df union select
> customer_id, uri, browser, epoch from df1").show()
> +---+-+---+-+
> |customer_id|  uri|browser|epoch|
> +---+-+---+-+
> |999|http://foobar|firefox| 1234|
> |888|http://foobar| ie|12343|
> +---+-+---+-+
>
> Cheers
>
> On Fri, Oct 30, 2015 at 12:11 PM, Yana Kadiyska <yana.kadiy...@gmail.com>
> wrote:
>
>> Hi folks,
>>
>> I have a need to "append" two dataframes -- I was hoping to use UnionAll
>> but it seems that this operation treats the underlying dataframes as
>> sequence of columns, rather than a map.
>>
>> In particular, my problem is that the columns in the two DFs are not in
>> the same order --notice that my customer_id somehow comes out a string:
>>
>> This is Spark 1.4.1
>>
>> case class Test(epoch: Long,browser:String,customer_id:Int,uri:String)
>> val test = Test(1234l,"firefox",999,"http://foobar;)
>>
>> case class Test1( customer_id :Int,uri:String,browser:String,   
>> epoch :Long)
>> val test1 = Test1(888,"http://foobar","ie",12343)
>> val df=sc.parallelize(Seq(test)).toDF
>> val df1=sc.parallelize(Seq(test1)).toDF
>> df.unionAll(df1)
>>
>> //res2: org.apache.spark.sql.DataFrame = [epoch: bigint, browser: string, 
>> customer_id: string, uri: string]
>>
>> 
>>
>> Is unionAll the wrong operation? Any special incantations? Or advice on
>> how to otherwise get this to succeeed?
>>
>
>

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Yana Kadiyska

For this issue in particular ( ERROR XSDB6: Another instance of Derby may
have already booted the database /spark/spark-1.4.1/metastore_db) -- I
think it depends on where you start your application and HiveThriftserver
from. I've run into a similar issue running a driver app first, which would
create a directory called metastore_db. If I then try to start SparkShell
from the same directory, I will see this exception. So it is like
SPARK-9776. It's not so much that the two are in the same process (as the
bug resolution states) I think you can't run 2 drivers which start a
HiveConext from the same directory.


On Wed, Oct 28, 2015 at 4:10 PM, Bryan Jeffrey 
wrote:

> All,
>
> One issue I'm seeing is that I start the thrift server (for jdbc access)
> via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master
> spark://master:7077 --hiveconf "spark.cores.max=2"
>
> After about 40 seconds the Thrift server is started and available on
> default port 1.
>
> I then submit my application - and the application throws the following
> error:
>
> Caused by: java.sql.SQLException: Failed to start database 'metastore_db'
> with class loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6a552721,
> see the next exception for details.
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
> Source)
> ... 86 more
> Caused by: java.sql.SQLException: Another instance of Derby may have
> already booted the database /spark/spark-1.4.1/metastore_db.
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
> Source)
> at
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown
> Source)
> at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown
> Source)
> ... 83 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted
> the database /spark/spark-1.4.1/metastore_db.
>
> This also happens if I do the opposite (submit the application first, and
> then start the thrift server).
>
> It looks similar to the following issue -- but not quite the same:
> https://issues.apache.org/jira/browse/SPARK-9776
>
> It seems like this set of steps works fine if the metadata database is not
> yet created - but once it's created this happens every time.  Is this a
> known issue? Is there a workaround?
>
> Regards,
>
> Bryan Jeffrey
>
> On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey 
> wrote:
>
>> Susan,
>>
>> I did give that a shot -- I'm seeing a number of oddities:
>>
>> (1) 'Partition By' appears only accepts alphanumeric lower case fields.
>> It will work for 'machinename', but not 'machineName' or 'machine_name'.
>> (2) When partitioning with maps included in the data I get odd string
>> conversion issues
>> (3) When partitioning without maps I see frequent out of memory issues
>>
>> I'll update this email when I've got a more concrete example of problems.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>>
>> On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang 
>> wrote:
>>
>>> Have you tried partitionBy?
>>>
>>> Something like
>>>
>>> hiveWindowsEvents.foreachRDD( rdd => {
>>>   val eventsDataFrame = rdd.toDF()
>>>   eventsDataFrame.write.mode(SaveMode.Append).partitionBy("
>>> windows_event_time_bin").saveAsTable("windows_event")
>>> })
>>>
>>>
>>>
>>> On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey 
>>> wrote:
>>>
 Hello.

 I am working to get a simple solution working using Spark SQL.  I am
 writing streaming data to persistent tables using a HiveContext.  Writing
 to a persistent non-partitioned table works well - I update the table using
 Spark streaming, and the output is available via Hive Thrift/JDBC.

 I create a table that looks like the following:

 0: jdbc:hive2://localhost:1> describe windows_event;
 describe windows_event;
 +--+-+--+
 | col_name |  data_type  | comment  |
 +--+-+--+
 | target_entity| string  | NULL |
 | target_entity_type   | string  | NULL |
 | date_time_utc| timestamp   | NULL |
 | machine_ip   | string  | NULL |
 | event_id | string  | NULL |
 | event_data   | map  | NULL |
 | description  | string  | NULL |
 | event_record_id  | string  | NULL |
 | level| string

Re: Maven build failed (Spark master)

2015-10-26 Thread Yana Kadiyska

In 1.4 ./make_distribution produces a .tgz file in the root directory (same
directory that make_distribution is in)



On Mon, Oct 26, 2015 at 8:46 AM, Kayode Odeyemi  wrote:

> Hi,
>
> The ./make_distribution task completed. However, I can't seem to locate the
> .tar.gz file.
>
> Where does Spark save this? or should I just work with the dist directory?
>
> On Fri, Oct 23, 2015 at 4:23 PM, Kayode Odeyemi  wrote:
>
>> I saw this when I tested manually (without ./make-distribution)
>>
>> Detected Maven Version: 3.2.2 is not in the allowed range 3.3.3.
>>
>> So I simply upgraded maven to 3.3.3.
>>
>> Resolved. Thanks
>>
>> On Fri, Oct 23, 2015 at 3:17 PM, Sean Owen  wrote:
>>
>>> This doesn't show the actual error output from Maven. I have a strong
>>> guess that you haven't set MAVEN_OPTS to increase the memory Maven can
>>> use.
>>>
>>> On Fri, Oct 23, 2015 at 6:14 AM, Kayode Odeyemi 
>>> wrote:
>>> > Hi,
>>> >
>>> > I can't seem to get a successful maven build. Please see command output
>>> > below:
>>> >
>>> > bash-3.2$ ./make-distribution.sh --name spark-latest --tgz --mvn mvn
>>> > -Dhadoop.version=2.7.0 -Phadoop-2.7 -Phive -Phive-thriftserver
>>> -DskipTests
>>> > clean package
>>> > +++ dirname ./make-distribution.sh
>>> > ++ cd .
>>> > ++ pwd
>>> > + SPARK_HOME=/usr/local/spark-latest
>>> > + DISTDIR=/usr/local/spark-latest/dist
>>> > + SPARK_TACHYON=false
>>> > + TACHYON_VERSION=0.7.1
>>> > + TACHYON_TGZ=tachyon-0.7.1-bin.tar.gz
>>> > +
>>> > TACHYON_URL=
>>> https://github.com/amplab/tachyon/releases/download/v0.7.1/tachyon-0.7.1-bin.tar.gz
>>> > + MAKE_TGZ=false
>>> > + NAME=none
>>> > + MVN=/usr/local/spark-latest/build/mvn
>>> > + ((  12  ))
>>> > + case $1 in
>>> > + NAME=spark-latest
>>> > + shift
>>> > + shift
>>> > + ((  10  ))
>>> > + case $1 in
>>> > + MAKE_TGZ=true
>>> > + shift
>>> > + ((  9  ))
>>> > + case $1 in
>>> > + MVN=mvn
>>> > + shift
>>> > + shift
>>> > + ((  7  ))
>>> > + case $1 in
>>> > + break
>>> > + '[' -z
>>> /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home ']'
>>> > + '[' -z
>>> /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home ']'
>>> > ++ command -v git
>>> > + '[' /usr/bin/git ']'
>>> > ++ git rev-parse --short HEAD
>>> > + GITREV=487d409
>>> > + '[' '!' -z 487d409 ']'
>>> > + GITREVSTRING=' (git revision 487d409)'
>>> > + unset GITREV
>>> > ++ command -v mvn
>>> > + '[' '!' /usr/bin/mvn ']'
>>> > ++ mvn help:evaluate -Dexpression=project.version
>>> -Dhadoop.version=2.7.0
>>> > -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package
>>> > ++ grep -v INFO
>>> > ++ tail -n 1
>>> > + VERSION='[ERROR] [Help 1]
>>> >
>>> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException'
>>> >
>>> > Same output error with JDK 7
>>> >
>>> > Appreciate your help.
>>> >
>>> >
>>>
>>
>>
>>
>

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska

thank you so much! You are correct. This is the second time I've made this
mistake :(

On Mon, Oct 26, 2015 at 11:36 AM, java8964  wrote:

> Maybe you need the Hive part?
>
> Yong
>
> --
> Date: Mon, 26 Oct 2015 11:34:30 -0400
> Subject: Problem with make-distribution.sh
> From: yana.kadiy...@gmail.com
> To: user@spark.apache.org
>
>
> Hi folks,
>
> building spark instructions (
> http://spark.apache.org/docs/latest/building-spark.html) suggest that
>
>
> ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
>
>
>
> should produce a distribution similar to the ones found on the "Downloads"
> page.
>
> I noticed that the tgz I built using the above command does not produce
> the datanucleus jars which are included in the "boxed" spark distributions.
> What is the best-practice advice here?
>
> I would like my distribution to match the official one as closely as
> possible.
>
> Thanks
>

Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska

Hi folks,

building spark instructions (
http://spark.apache.org/docs/latest/building-spark.html) suggest that


./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn



should produce a distribution similar to the ones found on the "Downloads"
page.

I noticed that the tgz I built using the above command does not produce the
datanucleus jars which are included in the "boxed" spark distributions.
What is the best-practice advice here?

I would like my distribution to match the official one as closely as
possible.

Thanks

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska

@Sean, here is where I think it's a little misleading (underlining is mine):

Building a Runnable Distribution

To create a *Spark distribution like those distributed by the Spark
Downloads <http://spark.apache.org/downloads.html> page*, and that is laid
out so as to be runnable, use make-distribution.sh in the project root
directory. It can be configured with Maven profile settings and so on like
the direct Maven build. Example:


Agreed that "like" doesn't necessarily imply "exactly the same". On the
other hand, if I go to the download page all I select is a hadoop version
and distribution, so it's not super-intuitive that -Phive was used to
produce these. I don't have a strong opinion on whether this should be a
fix to the script or the docs but now that it's bitten me twice I'm very
appreciative of either :)

Thanks

On Mon, Oct 26, 2015 at 1:29 PM, Sean Owen <so...@cloudera.com> wrote:

> I don't think the page suggests that gives you any of the tarballs on the
> downloads page, and -Phive does not by itself do so either.
>
> On Mon, Oct 26, 2015 at 4:58 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> I logged SPARK-11318 with a PR.
>>
>> I verified that by adding -Phive the datanucleus jars are included:
>>
>> tar tzvf spark-1.6.0-SNAPSHOT-bin-custom-spark.tgz | grep datanucleus
>> -rw-r--r-- hbase/hadoop 1890075 2015-10-26 09:52
>> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-core-3.2.10.jar
>> -rw-r--r-- hbase/hadoop339666 2015-10-26 09:52
>> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-api-jdo-3.2.6.jar
>> -rw-r--r-- hbase/hadoop   1809447 2015-10-26 09:52
>> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-rdbms-3.2.9.jar
>>
>> Cheers
>>
>> On Mon, Oct 26, 2015 at 8:52 AM, Yana Kadiyska <yana.kadiy...@gmail.com>
>> wrote:
>>
>>> thank you so much! You are correct. This is the second time I've made
>>> this mistake :(
>>>
>>> On Mon, Oct 26, 2015 at 11:36 AM, java8964 <java8...@hotmail.com> wrote:
>>>
>>>> Maybe you need the Hive part?
>>>>
>>>> Yong
>>>>
>>>> --
>>>> Date: Mon, 26 Oct 2015 11:34:30 -0400
>>>> Subject: Problem with make-distribution.sh
>>>> From: yana.kadiy...@gmail.com
>>>> To: user@spark.apache.org
>>>>
>>>>
>>>> Hi folks,
>>>>
>>>> building spark instructions (
>>>> http://spark.apache.org/docs/latest/building-spark.html) suggest that
>>>>
>>>>
>>>> ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
>>>>
>>>>
>>>>
>>>> should produce a distribution similar to the ones found on the
>>>> "Downloads" page.
>>>>
>>>> I noticed that the tgz I built using the above command does not produce
>>>> the datanucleus jars which are included in the "boxed" spark distributions.
>>>> What is the best-practice advice here?
>>>>
>>>> I would like my distribution to match the official one as closely as
>>>> possible.
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: SQLcontext changing String field to Long

2015-10-10 Thread Yana Kadiyska

can you show the output of df.printSchema? Just a guess but I think I ran
into something similar with a column that was part of a path in parquet.
E.g. we had an account_id in the parquet file data itself which was of type
string but we also named the files in the following manner
/somepath/account_id=.../file.parquet. Since Spark uses the paths for
partition discovery, it was actually inferring that account_id is a numeric
type and upon reading the data, we ran into the exception you're describing
(this is in Spark 1.4)..

On Fri, Oct 9, 2015 at 7:55 PM, Abhisheks  wrote:

> Hi there,
>
> I have saved my records in to parquet format and am using Spark1.5. But
> when
> I try to fetch the columns it throws exception*
> java.lang.ClassCastException: java.lang.Long cannot be cast to
> org.apache.spark.unsafe.types.UTF8String*.
>
> This filed is saved as String while writing parquet. so here is the sample
> code and output for the same..
>
> logger.info("troubling thing is ::" +
> sqlContext.sql(fileSelectQuery).schema().toString());
> DataFrame df= sqlContext.sql(fileSelectQuery);
> JavaRDD rdd2 = df.toJavaRDD();
>
> First Line in the code (Logger) prints this:
> troubling thing is ::StructType(StructField(batch_id,StringType,true))
>
> But the moment after it the execption comes up.
>
> Any idea why it is treating the filed as Long? (yeah one unique thing about
> column is it is always a number e.g. Time-stamp).
>
> Any help is appreciated.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SQLcontext-changing-String-field-to-Long-tp25005.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark-submit hive connection through spark Initial job has not accepted any resources

2015-10-10 Thread Yana Kadiyska

"Job has not accepted resources" is a well-known error message -- you can
search the Internet. 2 common causes come to mind:
1) you already have an application connected to the master -- by default a
driver will grab all resources so unless that application disconnects,
nothing else is allowed to connect
2) All your workers are dead/disconnected and there are no resources for
your master to allocate

As the error suggests "check your cluster UI to ensure that workers are
registered and have sufficient resources". If you can't see what's wrong,
maybe send a screenshot of your UI screen. But the error has nothing to do
with Hive -- this is a spark-driver connecting to master issue

The NativeCodeLoader warning is ignoreable

On Fri, Oct 9, 2015 at 6:52 AM, vinayak  wrote:

> Java code which I am trying to invoke.
>
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.sql.DataFrame;
> import org.apache.spark.sql.hive.HiveContext;
>
> public class SparkHiveInsertor {
>
> public static void main(String[] args) {
>
> SparkContext sctx=new SparkContext();
> System.out.println(">> starting
> "+sctx.isLocal());
> JavaSparkContext ctx=new JavaSparkContext(sctx);
> HiveContext hiveCtx=new HiveContext(ctx.sc());
> DataFrame df= hiveCtx.sql("show tables");
> System.out.println(">> count is
> "+df.count());
> }
> }
>
> command to submit job ./spark-submit --master spark://masterIp:7077
> --deploy-mode client --class com.ceg.spark.hive.sparkhive.SparkHiveInsertor
> --executor-cores 2 --executor-memory 1gb
> /home/someuser/Desktop/30sep2015/hivespark.jar
> --
> View this message in context: Re: spark-submit hive connection through
> spark Initial job has not accepted any resources
> 
>
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>

Re: Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska

Thanks a lot Cody! I was punting on the decoders by calling count (or
trying to, since my types require a custom decoder) but your sample code is
exactly what I was trying to achieve. The error message threw me off, will
work on the decoders now

On Tue, Sep 22, 2015 at 10:50 AM, Cody Koeninger <c...@koeninger.org> wrote:

> You need type parameters for the call to createRDD indicating the type of
> the key / value and the decoder to use for each.
>
> See
>
>
> https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/BasicRDD.scala
>
> Also, you need to check to see if offsets 0 through 100 are still actually
> present in the kafka logs.
>
> On Tue, Sep 22, 2015 at 9:38 AM, Yana Kadiyska <yana.kadiy...@gmail.com>
> wrote:
>
>> Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka
>> queue into HDFS. Being very new to Kafka, not sure if I'm messing something
>> up on that side...My hope is to read the messages presently in the queue
>> (or at least the first 100 for now)
>>
>> Here is what I have:
>> Kafka side:
>>
>>  ./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic ingress 
>> --broker-list IP1:9092,IP2:9092,IP3:9092 --time -1
>> ingress:0:34386
>> ingress:1:34148
>> ingress:2:34300
>>
>> 
>>
>> On Spark side I'm trying this(1.4.1):
>>
>> bin/spark-shell --jars
>> kafka-clients-0.8.2.0.jar,spark-streaming-kafka_2.10-1.4.1.jar,kafka_2.10-0.8.2.0.jar,metrics-core-2.2.0.ja
>>
>>
>>
>> val brokers="IP1:9092,IP2:9092,IP3:9092" //same as IPs above
>> val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
>>
>> val offsetRange= (0 to 2).map(part=>OffsetRange.create("ingress",part,0,100))
>> val messages= KafkaUtils.createRDD(sc,kafkaParams,offsetRange.toArray)
>> messages: org.apache.spark.rdd.RDD[(Nothing, Nothing)] = KafkaRDD[1] at RDD 
>> at KafkaRDD.scala:45
>>
>> 
>>
>> when I try messages.count I get:
>>
>> 15/09/22 14:01:17 ERROR TaskContextImpl: Error in TaskCompletionListener
>> java.lang.NullPointerException
>>  at 
>> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(KafkaRDD.scala:157)
>>  at 
>> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
>>  at 
>> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:100)
>>  at 
>> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:100)
>>  at 
>> org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:56)
>>  at 
>> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:75)
>>  at 
>> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:73)
>>  at 
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>  at 
>> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:73)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:72)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>

Re: Sending yarn application logs to web socket

2015-09-07 Thread Yana Kadiyska

Hopefully someone will give you a more direct answer but whenever I'm
having issues with log4j I always try -Dlog4j.debug=true.This will tell you
which log4j settings are getting picked up from where. I've spent countless
hours due to typos in the file, for example.

On Mon, Sep 7, 2015 at 11:47 AM, Jeetendra Gangele 
wrote:

> I also tried placing my costomized log4j.properties file under
> src/main/resources still no luck.
>
> won't above step modify the default YARN and spark  log4j.properties  ?
>
> anyhow its still taking log4j.properties from YARn.
>
>
>
> On 7 September 2015 at 19:25, Jeetendra Gangele 
> wrote:
>
>> anybody here to help?
>>
>>
>>
>> On 7 September 2015 at 17:53, Jeetendra Gangele 
>> wrote:
>>
>>> Hi All I have been trying to send my application related logs to socket
>>> so that we can write log stash and check the application logs.
>>>
>>> here is my log4j.property file
>>>
>>> main.logger=RFA,SA
>>>
>>> log4j.appender.SA=org.apache.log4j.net.SocketAppender
>>> log4j.appender.SA.Port=4560
>>> log4j.appender.SA.RemoteHost=hadoop07.housing.com
>>> log4j.appender.SA.ReconnectionDelay=1
>>> log4j.appender.SA.Application=NM-${user.dir}
>>> # Ignore messages below warning level from Jetty, because it's a bit
>>> verbose
>>> log4j.logger.org.spark-project.jetty=WARN
>>> log4j.logger.org.apache.hadoop=WARN
>>>
>>>
>>> I am launching my spark job using below common on YARN-cluster mode
>>>
>>> *spark-submit --name data-ingestion --master yarn-cluster --conf
>>> spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
>>>  --files
>>> /usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
>>> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>>> --conf
>>> "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>>> --class com.housing.spark.streaming.Binning
>>> /usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*
>>>
>>>
>>> *Can anybody please guide me why i am not getting the logs the socket?*
>>>
>>>
>>> *I followed many pages listing below without success*
>>>
>>> http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
>>>
>>> http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
>>>
>>> http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop
>>>
>>>
>>
>
>
>
>

Re: Problem with repartition/OOM

2015-09-06 Thread Yana Kadiyska

Thanks Yanbo,

I was running with 1G per executor; my file is 7.5 G, running with the
standard block size of 128M, resulting in 7500/128M= 59 partitions
naturally. My boxes have 8CPUs, so I figured they could be processing 8
tasks/partitions at a time, needing

8*(partition_size) memory per executor, so 8*128M = 1G

Is this the right way to do this math?

I'm confused about _decreasing_ the number of partitions -- I thought from
a spark perspective, 7.5G / 10 partitions would result in 750M per
partition. So a Spark executor with 8 cores would potentially need
750*8=6000M of memory

Maybe my confusion comes from terminology -- I thought in Spark the
"default" number of partitions is always the number of input splits. From
your example (number of partitions) * (Parquet block size) = Minimum
Required Memory,
yet this would also be the Parquet overall file size from my
understanding  (number
of partitions) = FileSize/(Parquet block size)

It cannot be that Minimum Required Memory= Parquet file size


On Sat, Sep 5, 2015 at 11:00 PM, Yanbo Liang <yblia...@gmail.com> wrote:

> The Parquet output writer allocates one block for each table partition it
> is processing and writes partitions in parallel. It will run out of
> memory if (number of partitions) times (Parquet block size) is greater than
> the available memory. You can try to decrease the number of partitions. And
> could you share the value of "parquet.block.size" and your available memory?
>
> 2015-09-05 18:59 GMT+08:00 Yana Kadiyska <yana.kadiy...@gmail.com>:
>
>> Hi folks, I have a strange issue. Trying to read a 7G file and do failry
>> simple stuff with it:
>>
>> I can read the file/do simple operations on it. However, I'd prefer to
>> increase the number of partitions in preparation for more memory-intensive
>> operations (I'm happy to wait, I just need the job to complete).
>> Repartition seems to cause an OOM for me?
>> Could someone shed light/or speculate/ why this would happen -- I thought
>> we repartition higher to relieve memory pressure?
>>
>> Im using Spark1.4.1 CDH4 if that makes a difference
>>
>> This works
>>
>> val res2 = sqlContext.parquetFile(lst:_*).where($"customer_id"===lit(254))
>> res2.count
>> res1: Long = 77885925
>>
>> scala> res2.explain
>> == Physical Plan ==
>> Filter (customer_id#314 = 254)
>>  PhysicalRDD [4], MapPartitionsRDD[11] at
>>
>> scala> res2.rdd.partitions.size
>> res3: Int = 59
>>
>> 
>>
>>
>> This doesnt:
>>
>> scala> res2.repartition(60).count
>> [Stage 2:>(1 + 45) / 
>> 59]15/09/05 10:17:21 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 
>> 62, fqdn): java.lang.OutOfMemoryError: Java heap space
>> at 
>> parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:729)
>> at 
>> parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:490)
>> at 
>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:116)
>> at 
>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>> at 
>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>> at 
>> org.apache.spark.sql.sources.SqlNewHadoopRDD$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
>> at 
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)
>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
>> at 
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:207)
>> at 
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> 
>>
>
>

Re: Failing to include multiple JDBC drivers

2015-09-05 Thread Yana Kadiyska

If memory serves me correctly in 1.3.1 at least there was a problem with
when the driver was added -- the right classloader wasn't picking it up.
You can try searching the archives, but the issue is similar to these
threads:
http://stackoverflow.com/questions/30940566/connecting-from-spark-pyspark-to-postgresql
http://stackoverflow.com/questions/30221677/spark-sql-postgresql-jdbc-classpath-issues

I thought this was fixed in 1.4.1...but in any case, maybe try setting
SPARK_CLASSPATH explicitly if 1.4.1 is still a no-go...might be a PITA if
you're doing reads as you'd have to do it on each slave -- I'd try to run
with a single slave until you get this fixed...

On Fri, Sep 4, 2015 at 11:59 PM, Nicholas Connor <
nicholas.k.con...@gmail.com> wrote:

> So, I need to connect to multiple databases to do cool stuff with Spark.
> To do this, I need multiple database drivers: Postgres + MySQL.
>
> *Problem*: Spark fails to run both drivers
>
> This method works for one driver at a time:
>
> spark-submit  --driver-class-path="/driver.jar"
>
> These methods do not work for one driver, or many (though Spark does say
> Added "driver.jar" with timestamp *** in the log):
>
>- spark-submit --jars "driver1.jar, driver2.jar"
>- sparkContext.addJar("driver.jar")
>- echo 'spark.driver.extraClassPath="driver.jar"' >>
>spark-defaults.conf
>- echo 'spark.executor.extraClassPath="driver.jar"' >>
>spark-defaults.conf
>- sbt assembly (fat jar with drivers)
>
> *Example error:*
>
> Exception in thread "main" java.sql.SQLException: No suitable driver found
> for jdbc:mysql:// at
> com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055) at
> com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956) at
> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3491) at
> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3423) at
> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:910) at
> com.mysql.jdbc.MysqlIO.secureAuth411(MysqlIO.java:3923) at
> com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1273)
>
> *Versions Tested*: Spark 1.3.1 && 1.4.1
>
> What method can I use to load both drivers?
> Thanks,
>
> Nicholas Connor
>

Problem with repartition/OOM

2015-09-05 Thread Yana Kadiyska

Hi folks, I have a strange issue. Trying to read a 7G file and do failry
simple stuff with it:

I can read the file/do simple operations on it. However, I'd prefer to
increase the number of partitions in preparation for more memory-intensive
operations (I'm happy to wait, I just need the job to complete).
Repartition seems to cause an OOM for me?
Could someone shed light/or speculate/ why this would happen -- I thought
we repartition higher to relieve memory pressure?

Im using Spark1.4.1 CDH4 if that makes a difference

This works

val res2 = sqlContext.parquetFile(lst:_*).where($"customer_id"===lit(254))
res2.count
res1: Long = 77885925

scala> res2.explain
== Physical Plan ==
Filter (customer_id#314 = 254)
 PhysicalRDD [4], MapPartitionsRDD[11] at

scala> res2.rdd.partitions.size
res3: Int = 59




This doesnt:

scala> res2.repartition(60).count
[Stage 2:>(1 +
45) / 59]15/09/05 10:17:21 WARN TaskSetManager: Lost task 2.0 in stage
2.0 (TID 62, fqdn): java.lang.OutOfMemoryError: Java heap space
at 
parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:729)
at 
parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:490)
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:116)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:207)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

[SQL/Hive] Trouble with refreshTable

2015-08-25 Thread Yana Kadiyska

I'm having trouble with refreshTable, I suspect because I'm using it
incorrectly.

I am doing the following:

1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet
2. use registerTempTable to register my dataframe
3. A new file is dropped under  /foo/bar/
4. Call hiveContext.refreshTable in the hope that the paths for the
Dataframe are re-evaluated

Step 4 does not work as I imagine -- if I have 1 file in step 1, and 2
files in step 3, I still get the same count when I query the table

So I have 2 questions

1). Is there a way to see the files that a Dataframe/RDD is underpinned by
2). What is a reasonable way to refresh the table with newcomer data --
I'm suspecting I have to start over from step 1 to force the Dataframe to
re-see new files, but am hoping there is a simpler way (I know frames are
immutable but they are also lazy so I'm thinking paths with wildcards
evaluated per call might be possible?)

Thanks for any insights.

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Yana Kadiyska

The PermGen space error is controlled with MaxPermSize parameter. I run
with this in my pom, I think copied pretty literally from Spark's own
tests... I don't know what the sbt equivalent is but you should be able to
pass it...possibly via SBT_OPTS?


 plugin
  groupIdorg.scalatest/groupId
  artifactIdscalatest-maven-plugin/artifactId
  version1.0/version
  configuration

reportsDirectory${project.build.directory}/surefire-reports/reportsDirectory
  parallelfalse/parallel
  junitxml./junitxml
  filereportsSparkTestSuite.txt/filereports
  argLine-Xmx3g -XX:MaxPermSize=256m
-XX:ReservedCodeCacheSize=512m/argLine
  stderr/
  systemProperties
  java.awt.headlesstrue/java.awt.headless
  spark.testing1/spark.testing
  spark.ui.enabledfalse/spark.ui.enabled

spark.driver.allowMultipleContextstrue/spark.driver.allowMultipleContexts
  /systemProperties
  /configuration
  executions
  execution
  idtest/id
  goals
  goaltest/goal
  /goals
  /execution
  /executions
  /plugin
  /plugins


On Tue, Aug 25, 2015 at 2:10 PM, Mike Trienis mike.trie...@orcsol.com
wrote:

 Hello,

 I am using sbt and created a unit test where I create a `HiveContext` and
 execute some query and then return. Each time I run the unit test the JVM
 will increase it's memory usage until I get the error:

 Internal error when running tests: java.lang.OutOfMemoryError: PermGen
 space
 Exception in thread Thread-2 java.io.EOFException

 As a work-around, I can fork a new JVM each time I run the unit test,
 however, it seems like a bad solution as takes a while to run the unit
 test.

 By the way, I tried to importing the TestHiveContext:

- import org.apache.spark.sql.hive.test.TestHiveContext

 However, it suffers from the same memory issue. Has anyone else suffered
 from the same problem? Note that I am running these unit tests on my mac.

 Cheers, Mike.

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-24 Thread Yana Kadiyska

that is pretty odd -- toMap not being there would be from scala...but what
is even weirder is that toMap is positively executed on the driver machine,
which is the same when you do spark-shell and spark-submit...does it work
if you run with --master local[*]?

Also, you can try to put a set -x in bin/spark-class right before the
RUNNER gets invoked -- this will show you the exact java command that is
being run, classpath and all

On Thu, Jul 23, 2015 at 3:14 PM, Dan Dong dongda...@gmail.com wrote:

 The problem should be toMap, as I tested that val maps2=maps.collect
 runs ok. When I run spark-shell, I run with --master
 mesos://cluster-1:5050 parameter which is the same with spark-submit.
 Confused here.



 2015-07-22 20:01 GMT-05:00 Yana Kadiyska yana.kadiy...@gmail.com:

 Is it complaining about collect or toMap? In either case this error
 is indicative of an old version usually -- any chance you have an old
 installation of Spark somehow? Or scala? You can try running spark-submit
 with --verbose. Also, when you say it runs with spark-shell do you run
 spark shell in local mode or with --master? I'd try with --master whatever
 master you use for spark-submit

 Also, if you're using standalone mode I believe the worker log contains
 the launch command for the executor -- you probably want to examine that
 classpath carefully

 On Wed, Jul 22, 2015 at 5:25 PM, Dan Dong dongda...@gmail.com wrote:

 Hi,

   I have a simple test spark program as below, the strange thing is that
 it runs well under a spark-shell, but will get a runtime error of

 java.lang.NoSuchMethodError:

 in spark-submit, which indicate the line of:

 val maps2=maps.collect.toMap

 has problem. But why the compilation has no problem and it works well
 under spark-shell(==maps2: scala.collection.immutable.Map[Int,String] =
 Map(269953 - once, 97 - a, 451002 - upon, 117481 - was, 226916 -
 there, 414413 - time, 146327 - king) )? Thanks!

 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkConf
 import org.apache.spark.mllib.feature.HashingTF
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.rdd.RDD
 import org.apache.spark.SparkContext
 import org.apache.spark._
 import SparkContext._


 val docs=sc.parallelize(Array(Array(once ,upon, a, time), 
 Array(there, was, a, king)))

 val hashingTF = new HashingTF()

 val maps=docs.flatMap{term=term.map(ele=(hashingTF.indexOf(ele),ele))}

 val maps2=maps.collect.toMap


 Cheers,

 Dan

Help with Dataframe syntax ( IN / COLLECT_SET)

2015-07-23 Thread Yana Kadiyska

Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In
other words, I'd like to figure out how to write the following query:

select collect_set(b),a from mytable where c in (1,2,3) group by a

I've started with

  someDF
  .where( -- not sure what do for c here---
  .groupBy($a)
  .agg(-- collect_set is not part of sql functions as far as I see...--)



I know I can register a table and do raw sql but I'm trying to figure out
the DF route...

Help much appreciated.

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-22 Thread Yana Kadiyska

Is it complaining about collect or toMap? In either case this error is
indicative of an old version usually -- any chance you have an old
installation of Spark somehow? Or scala? You can try running spark-submit
with --verbose. Also, when you say it runs with spark-shell do you run
spark shell in local mode or with --master? I'd try with --master whatever
master you use for spark-submit

Also, if you're using standalone mode I believe the worker log contains the
launch command for the executor -- you probably want to examine that
classpath carefully

On Wed, Jul 22, 2015 at 5:25 PM, Dan Dong dongda...@gmail.com wrote:

 Hi,

   I have a simple test spark program as below, the strange thing is that
 it runs well under a spark-shell, but will get a runtime error of

 java.lang.NoSuchMethodError:

 in spark-submit, which indicate the line of:

 val maps2=maps.collect.toMap

 has problem. But why the compilation has no problem and it works well
 under spark-shell(==maps2: scala.collection.immutable.Map[Int,String] =
 Map(269953 - once, 97 - a, 451002 - upon, 117481 - was, 226916 -
 there, 414413 - time, 146327 - king) )? Thanks!

 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkConf
 import org.apache.spark.mllib.feature.HashingTF
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.rdd.RDD
 import org.apache.spark.SparkContext
 import org.apache.spark._
 import SparkContext._


 val docs=sc.parallelize(Array(Array(once ,upon, a, time), 
 Array(there, was, a, king)))

 val hashingTF = new HashingTF()

 val maps=docs.flatMap{term=term.map(ele=(hashingTF.indexOf(ele),ele))}

 val maps2=maps.collect.toMap


 Cheers,

 Dan

PairRDDFunctions and DataFrames

2015-07-16 Thread Yana Kadiyska

Hi, could someone point me to the recommended way of using
countApproxDistinctByKey with DataFrames?

I know I can map to pair RDD but I'm wondering if there is a simpler
method? If someone knows if this operations is expressible in SQL that
information would be most appreciated as well.

Re: Select all columns except some

2015-07-16 Thread Yana Kadiyska

Have you tried to examine what clean_cols contains -- I'm suspect of this
part mkString(“, “).
Try this:
val clean_cols : Seq[String] = df.columns...

if you get a type error you need to work on clean_cols (I suspect yours is
of type String at the moment and presents itself to Spark as a single
column names with commas embedded).

Not sure why the .drop call hangs but in either case drop returns a new
dataframe -- it's not a setter call

On Thu, Jul 16, 2015 at 10:57 AM, saif.a.ell...@wellsfargo.com wrote:

  Hi,

 In a hundred columns dataframe, I wish to either *select all of them
 except* or *drop the ones I dont want*.

 I am failing in doing such simple task, tried two ways

 val clean_cols = df.columns.filterNot(col_name =
 col_name.startWith(“STATE_”).mkString(“, “)
 df.select(clean_cols)

 But this throws exception:
 org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt,
 industry_area,...’
 at
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63)
 at
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
 at
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)

 The other thing I tried is

 df.columns.filter(col_name = col_name.startWith(“STATE_”)
 for (col - cols) df.drop(col)

 But this other thing doesn’t do anything or hangs up.

 Saif

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Yana Kadiyska

Have you seen this SO thread:
http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin

This seems to be more related to the plugin than Spark, looking at the
stack trace

On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:

 I m still looking forward for the answer. I want to know how to properly
 close everything about spark in java standalone app.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-ThreadException-in-Apache-Spark-standalone-Java-Application-tp23675p23821.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL 'describe table' tries to look at all records

2015-07-13 Thread Yana Kadiyska

Have you seen https://issues.apache.org/jira/browse/SPARK-6910I opened
https://issues.apache.org/jira/browse/SPARK-6984 which I think is related
to this as well. There are a bunch of issues attached to it but basically
yes, Spark interactions with a large metastore are bad...very bad if your
metastore is large.

On Sun, Jul 12, 2015 at 11:39 PM, Jerrick Hoang jerrickho...@gmail.com
wrote:

 Sorry all for not being clear. I'm using spark 1.4 and the table is a hive
 table, and the table is partitioned.

 On Sun, Jul 12, 2015 at 6:36 PM, Yin Huai yh...@databricks.com wrote:

 Jerrick,

 Let me ask a few clarification questions. What is the version of Spark?
 Is the table a hive table? What is the format of the table? Is the table
 partitioned?

 Thanks,

 Yin

 On Sun, Jul 12, 2015 at 6:01 PM, ayan guha guha.a...@gmail.com wrote:

 Describe computes statistics, so it will try to query the table. The one
 you are looking for is df.printSchema()

 On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang jerrickho...@gmail.com
 wrote:

 Hi all,

 I'm new to Spark and this question may be trivial or has already been
 answered, but when I do a 'describe table' from SparkSQL CLI it seems to
 try looking at all records at the table (which takes a really long time for
 big table) instead of just giving me the metadata of the table. Would
 appreciate if someone can give me some pointers, thanks!




 --
 Best Regards,
 Ayan Guha

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska

It's a bit hard to tell from the snippets of code but it's likely related
to the fact that when you serialize instances the enclosing class, if any,
also gets serialized, as well as any other place where fields used in the
closure come from...e.g.check this discussion:
http://stackoverflow.com/questions/9747443/java-io-invalidclassexception-no-valid-constructor

The programming guide also has good advice on serialization issues. I would
particulary check how Shortsale/Nomatch/Foreclosure are declared (I'd
advise making them top-level case classes)...

On Mon, Jul 13, 2015 at 12:32 PM, saif.a.ell...@wellsfargo.com wrote:

  Hi,

 For some experiment I am doing, I am trying to do the following.

 1.Created an abstract class Validator. Created case objects from Validator
 with validate(row: Row): Boolean method.

 2. Adding in a list all case objects

 3. Each validate takes a Row into account, returns *“itself”* if validate
 returns true, so then, I do this to return an arbitrary number for each
 match
  def evaluate_paths(row: Row, validators: List[ClientPath]): Int = {

 var result: Int = -1

 for (validator - validators) {
 validator.validate(row) match {
 case Shortsale =  result = 0
 case Foreclosure = result = 1
 case Nomatch = result 99
 //...
 }
 }
 result
 }

 val validators = List[ClientPath](
 Shortsale,
 Foreclosure)

 4. Then I run the map[Int](row = evaluate_paths(row, validators)

 But this blows up, it does not like the creation of the list of validators
 when executing an action function on the RDD such as take(1).
 I have tried also instead of list, an Iterator and Array, but no case.
 Also replaced the for loop with a while loop.
 Curiously, I tried with custom-made Rows, and the execution works
 properly, when calling evaluate_paths(some_row, validators).

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 125.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 125.0 (TID 830, localhost): java.io.InvalidClassException:
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Nomatch$;
 no valid constructor at
 java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
 at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:768)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 ...
 ...
 ...
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
 at
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at
 org.apache.spark.scheduler.Task.run(Task.scala:70) at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace: at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
 at scala.Option.foreach(Option.scala:236) at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

  --

 Any

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska

I would certainly try to mark the Validator class as Serializable...If that
doesn't do it you can also try and see if this flag sheds more light:
-Dsun.io.serialization.extendedDebugInfo=true

 By programming guide I mean this:
https://spark.apache.org/docs/latest/programming-guide.html I could almost
swear I had seen an extended section on tricky serialization issues (i.e.
scenarios where you end up serializing more than you think because of what
your closure captures) but I can't locate this section now...

On Mon, Jul 13, 2015 at 1:30 PM, saif.a.ell...@wellsfargo.com wrote:

  Thank you very much for your time, here is how I designed the case
 classes, as far as I know they apply properly.



 Ps: By the way, what do you mean by “The programming guide?”



 abstract class Validator {



 // positions to access with Row.getInt(x)

 val shortsale_in_pos = 10

 val month_pos = 11

 val foreclosure_start_dt_pos = 14

 val filemonth_dt_pos = 12

 val reo_start_dt_pos = 14

 // ..



 // redesign into Iterable of Rows --

 def validate(input: org.apache.spark.sql.Row): Validator



 }



 case object Nomatch extends Validator {

 def validate(input: Row): Validator = this

   }



 case object Shortsale extends Validator {

 def validate(input: Row): Validator = {

 var check1: Boolean = if (input.getDouble(shortsale_in_pos) 
 140.0) true else false

 if (check1) this else Nomatch

 }

 }



 Saif



 *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
 *Sent:* Monday, July 13, 2015 2:16 PM
 *To:* Ellafi, Saif A.
 *Cc:* user@spark.apache.org
 *Subject:* Re: java.io.InvalidClassException



 It's a bit hard to tell from the snippets of code but it's likely related
 to the fact that when you serialize instances the enclosing class, if any,
 also gets serialized, as well as any other place where fields used in the
 closure come from...e.g.check this discussion:
 http://stackoverflow.com/questions/9747443/java-io-invalidclassexception-no-valid-constructor



 The programming guide also has good advice on serialization issues. I
 would particulary check how Shortsale/Nomatch/Foreclosure are declared
 (I'd advise making them top-level case classes)...



 On Mon, Jul 13, 2015 at 12:32 PM, saif.a.ell...@wellsfargo.com wrote:

 Hi,



 For some experiment I am doing, I am trying to do the following.



 1.Created an abstract class Validator. Created case objects from Validator
 with validate(row: Row): Boolean method.



 2. Adding in a list all case objects



 3. Each validate takes a Row into account, returns *“itself” *if validate
 returns true, so then, I do this to return an arbitrary number for each
 match

 def evaluate_paths(row: Row, validators: List[ClientPath]): Int = {



 var result: Int = -1



 for (validator - validators) {

 validator.validate(row) match {

 case Shortsale =  result = 0

 case Foreclosure = result = 1

 case Nomatch = result 99

 //...

 }

 }

 result

 }



 val validators = List[ClientPath](

 Shortsale,

 Foreclosure)



 4. Then I run the map[Int](row = evaluate_paths(row, validators)



 But this blows up, it does not like the creation of the list of validators
 when executing an action function on the RDD such as take(1).

 I have tried also instead of list, an Iterator and Array, but no case.
 Also replaced the for loop with a while loop.

 Curiously, I tried with custom-made Rows, and the execution works
 properly, when calling evaluate_paths(some_row, validators).



 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 125.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 125.0 (TID 830, localhost): java.io.InvalidClassException:
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$Nomatch$;
 no valid constructor at
 java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
 at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:768)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

 ...

 ...

 ...

 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-13 Thread Yana Kadiyska

Oh, this is very interesting -- can you explain about your dependencies --
I'm running Tomcat 7 and ended up using spark-assembly from WEB_INF/lib and
removing the javax/servlet package out of it...but it's a pain in the neck.
If I'm reading your first message correctly you use hadoop common and
spark-core? Although I'm not sure why spark-core shows under gson:

[INFO] +- com.google.code.gson:gson:jar:2.2.2:compile
[INFO] \- org.apache.spark:spark-core_2.10:jar:1.4.0:compile


Do you fat jar spark-core? Do you have spark-assembly in Tomcat's
runtime classpath anywhere? Curious on what a minimal setup here is.


Thanks a lot (I'd love to see your .pom if you have it on github or
somewhere accessible).


On Fri, Jul 10, 2015 at 2:24 PM, Zoran Jeremic zoran.jere...@gmail.com
wrote:

 It looks like there is no problem with Tomcat 8.

 On Fri, Jul 10, 2015 at 11:12 AM, Zoran Jeremic zoran.jere...@gmail.com
 wrote:

 Hi Ted,

 I'm running Tomcat 7 with Java:

 java version 1.8.0_45
 Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

 Zoran


 On Fri, Jul 10, 2015 at 10:45 AM, Ted Yu yuzhih...@gmail.com wrote:

 What version of Java is Tomcat run ?

 Thanks



 On Jul 10, 2015, at 10:09 AM, Zoran Jeremic zoran.jere...@gmail.com
 wrote:

 Hi,

 I've developed maven application that uses mongo-hadoop connector to
 pull data from mongodb and process it using Apache spark. The whole process
 runs smoothly if I run it on embedded Jetty server. However, if I deploy it
 to Tomcat server 7, it's always interrupted at the line of code which
 collects data from JavaPairRDD with exception that doesn't give me any clue
 what the problem might be:

 15/07/09 20:42:05 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 
 in memory on localhost:58302 (size: 6.3 KB, free: 946.6 MB)
  15/07/09 20:42:05 INFO spark.SparkContext: Created broadcast 0 from 
 newAPIHadoopRDD at SparkCategoryRecommender.java:106Exception in thread 
 Thread-6 java.lang.IncompatibleClassChangeError: Implementing class
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
 at 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at 
 org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2918)
 at 
 org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1174)
 at 
 org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1669)
 at 
 org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1547)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:264)
 at 
 org.apache.spark.mapreduce.SparkHadoopMapReduceUtil$class.firstAvailableClass(SparkHadoopMapReduceUtil.scala:74)
 at 
 org.apache.spark.mapreduce.SparkHadoopMapReduceUtil$class.newJobContext(SparkHadoopMapReduceUtil.scala:28)
 at 
 org.apache.spark.rdd.NewHadoopRDD.newJobContext(NewHadoopRDD.scala:66)
 at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1779)
 at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:885)
 at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:884)
 at 
 org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:335)
 at 
 org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:47)
 at 
 com.warrantylife.iqmetrix.spark.SparkCategoryRecommender.buildDataModelOnProductid(SparkCategoryRecommender.java:149)
 at 
 com.warrantylife.iqmetrix.spark.SparkCategoryRecommender.computeAndStoreRecommendationsForProductsInCategory(SparkCategoryRecommender.java:179)
 at 
 com.warrantylife.recommendation.impl.CategoryRecommendationManagerImpl.computeRecommendationsForCategory(CategoryRecommendationManagerImpl.java:105)
 at 
 com.warrantylife.testdata.TestDataGeneratorService$1.run(TestDataGeneratorService.java:55)
 at java.lang.Thread.run(Thread.java:745)


 I guess there is some library

[SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska

Hi folks, I just re-wrote a query from using UNION ALL to use with rollup
and I'm seeing some unexpected behavior. I'll open a JIRA if needed but
wanted to check if this is user error. Here is my code:

case class KeyValue(key: Int, value: String)
val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF

df.registerTempTable(foo)

sqlContext.sql(“select count(*) as cnt, value as key,GROUPING__ID from
foo group by value with rollup”).show(100)


sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID
from foo group by key%100 with rollup”).show(100)



Grouping by value does the right thing, I get one group 0 with the overall
count. But grouping by expression (key%100) produces weird results --
appears that group 1 results are replicated as group 0. Am I doing
something wrong or is this a bug?

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska

+---+---+---+
|cnt|_c1|grp|
+---+---+---+
|  1| 31|  0|
|  1| 31|  1|
|  1|  4|  0|
|  1|  4|  1|
|  1| 42|  0|
|  1| 42|  1|
|  1| 15|  0|
|  1| 15|  1|
|  1| 26|  0|
|  1| 26|  1|
|  1| 37|  0|
|  1| 10|  0|
|  1| 37|  1|
|  1| 10|  1|
|  1| 48|  0|
|  1| 21|  0|
|  1| 48|  1|
|  1| 21|  1|
|  1| 32|  0|
|  1| 32|  1|
+---+---+---+



On Thu, Jul 9, 2015 at 11:54 AM, ayan guha guha.a...@gmail.com wrote:

 Can you please post result of show()?
 On 10 Jul 2015 01:00, Yana Kadiyska yana.kadiy...@gmail.com wrote:

 Hi folks, I just re-wrote a query from using UNION ALL to use with
 rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed
 but wanted to check if this is user error. Here is my code:

 case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF

 df.registerTempTable(foo)

 sqlContext.sql(“select count(*) as cnt, value as key,GROUPING__ID from foo 
 group by value with rollup”).show(100)


 sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID from 
 foo group by key%100 with rollup”).show(100)

 

 Grouping by value does the right thing, I get one group 0 with the
 overall count. But grouping by expression (key%100) produces weird results
 -- appears that group 1 results are replicated as group 0. Am I doing
 something wrong or is this a bug?

How to debug java.io.OptionalDataException issues

2015-07-06 Thread Yana Kadiyska

Hi folks, suffering from a pretty strange issue:

Is there a way to tell what object is being successfully
serialized/deserialized? I have a maven-installed jar that works well when
fat jarred within another, but shows the following stack when marked as
provided and copied to the runtime classpath...I'm pretty puzzled but can't
find any good way to debug what is causing unhappiness?

15/07/07 00:24:23 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
0, osd04.shaka.rum.tn.akamai.com): java.io.OptionalDataException
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1370)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.$colon$colon.readObject(List.scala:366)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.$colon$colon.readObject(List.scala:366)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)

 at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Difference between spark-defaults.conf and SparkConf.set

2015-06-30 Thread Yana Kadiyska

Hi folks, running into a pretty strange issue:

I'm setting
spark.executor.extraClassPath
spark.driver.extraClassPath

to point to some external JARs. If I set them in spark-defaults.conf
everything works perfectly.
However, if I remove spark-defaults.conf and just create a SparkConf and
call
.set(spark.executor.extraClassPath,...)
.set(spark.driver.extraClassPath,...)

I get ClassNotFound exceptions from Hadoop Conf:

Caused by: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.ceph.CephFileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1585)



This seems like a bug to me -- or does spark-defaults.conf somehow get
processed differently?

I have dumped out sparkConf.toDebugString and in both cases
(spark-defaults.conf/in code sets) it seems to have the same values in it...

Re: Debugging Apache Spark clustered application from Eclipse

2015-06-25 Thread Yana Kadiyska

Pass that debug string to your executor like this: --conf
spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=
7761. When your executor is launched it will send debug information on
port 7761. When you attach the Eclipse debugger, you need to have the
IP/host of that executor as the remote process to attach to. This is pretty
standard remote debugging procedure .

In the interest of sanity I'd recommend you run with just 1 executor when
you're debugging...makes life easier (of course the easiest thing is if you
can do the debugging in local[*] mode but some bugs don't manifest that way)

On Thu, Jun 25, 2015 at 1:17 AM, nitinkalra2000 nitinkalra2...@gmail.com
wrote:

I am trying to debug Spark application running on eclipse in
clustered/distributed environment but not able to succeed. Application is
java based and I am running it through Eclipse. Configurations to spark for
Master/worker is provided through Java only.

Though I can debug the code on driver side but as the code flow moves in
Spark(i.e call to .map(..)), the debugger doesn't stop. Because that code
is
running in Workers JVM.

Is there anyway I can achieve this ?

I have tried giving following configurations in Tomcat through Eclipse :
-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=7761,suspend=n

and setting respective port in Debug-remote java application.

But after these settings I get the error: Failed to connect to remote VM.
Connection Refused

Note: I have tried this on Windows as well as on Linux(CentOs) environment.

If anybody has any solution to this, please help.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Debugging-Apache-Spark-clustered-application-from-Eclipse-tp23483.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark stream test throw org.apache.spark.SparkException: Task not serializable when execute in spark shell

2015-06-24 Thread Yana Kadiyska

I can't tell immediately, but you might be able to get more info with the
hint provided here:
http://stackoverflow.com/questions/27980781/spark-task-not-serializable-with-simple-accumulator
(short version, set -Dsun.io.serialization.extendedDebugInfo=true)

Also, unless you're simplifying your example a lot, you only have 2
regexes, so I'm not quite sure why you want to broadcast them, as opposed
to just having an object that holds them on each executor, or just create
them at the start of mapPartitions (outside of iter.hasNext as shown in
your second snippet). Broadcasting seems overcomplicated, but maybe you
just showed a simplified example...

On Wed, Jun 24, 2015 at 8:41 AM, yuemeng (A) yueme...@huawei.com wrote:

  hi ,all

 there two examples one is throw Task not serializable when execute in
 spark shell,the other one is ok,i am very puzzled,can anyone give what's
 different about this two code and why the other is ok

 1.The one which throw Task not serializable :

 import org.apache.spark._
 import SparkContext._
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.broadcast._



 @transient val ssc = new StreamingContext(sc, Seconds(5))
 val lines = ssc.textFileStream(/a.log)





 val testFun = (line:String) = {
 if ((line.contains( ERROR)) || (line.startsWith(Spark))){
 true
 }
 else{
 false
 }
 }

 val p_date_bc = sc.broadcast(^\\w+ \\w+ \\d+ \\d{2}:\\d{2}:\\d{2}
 \\d{4}.r)
 val p_ORA_bc = sc.broadcast(^ORA-\\d+.+.r)
 val A = (iter:
 Iterator[String],data_bc:Broadcast[scala.util.matching.Regex],ORA_bc:Broadcast[scala.util.matching.Regex])
 = {
 val p_date = data_bc.value
 val p_ORA = ORA_bc.value
 var res = List[String]()
 var lasttime = 

 while (iter.hasNext) {
 val line = iter.next.toString
 val currentcode = p_ORA findFirstIn line getOrElse null
 if (currentcode != null){
 res ::= lasttime +  |  + currentcode
 }else{
 val currentdate = p_date findFirstIn line getOrElse
 null
 if (currentdate != null){
 lasttime = currentdate
 }
 }
 }
 res.iterator
 }

 val cdd = lines.filter(testFun).mapPartitions(x =
 A(x,p_date_bc,p_ORA_bc))  //org.apache.spark.SparkException: Task not
 serializable



 2.The other one is ok:



 import org.apache.spark._
 import SparkContext._
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.broadcast._



 val ssc = new StreamingContext(sc, Seconds(5))
 val lines = ssc.textFileStream(/a.log)





 val testFun = (line:String) = {
 if ((line.contains( ERROR)) || (line.startsWith(Spark))){
 true
 }
 else{
 false
 }
 }



 val A = (iter: Iterator[String]) = {

 var res = List[String]()
 var lasttime = 
 while (iter.hasNext) {
 val line = iter.next.toString
 val currentcode = ^\\w+ \\w+ \\d+ \\d{2}:\\d{2}:\\d{2}
 \\d{4}.r.findFirstIn(line).getOrElse(null)
 if (currentcode != null){
 res ::= lasttime +  |  + currentcode
 }else{
  val currentdate =
 ^ORA-\\d+.+.r.findFirstIn(line).getOrElse(null)
 if (currentdate != null){
 lasttime = currentdate
 }
 }
 }
 res.iterator
 }



 val cdd= lines.filter(testFun).mapPartitions(A)

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Yana Kadiyska

Thanks, that did seem to make a difference. I am a bit scared of this
approach as spark itself has a different guava dependency but the error
does go away this way

On Wed, Jun 24, 2015 at 10:04 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you try to add those jars in the SPARK_CLASSPATH and give it a try?

 Thanks
 Best Regards

 On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Hi folks, I have been using Spark against an external Metastore service
 which runs Hive with Cdh 4.6

 In Spark 1.2, I was able to successfully connect by building with the
 following:

 ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0
 -Phive-thriftserver -Phive-0.12.0

 I see that in Spark 1.4 the Hive 0.12.0 profile is deprecated in favor of
 spark.sql.hive.metastore.version/spark.sql.hive.metastore.jars

 When I tried to use this setup spark-shell fails for me with the
 following error:

 15/06/23 18:18:07 INFO hive.HiveContext: Initializing 
 HiveMetastoreConnection version 0.12.0 using [Ljava.net.URL;@7b7a9a6c
 java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
 com/google/common/base/Preconditions when creating Hive client using 
 classpath: file:/hive/lib/guava-11.0.2.jar, 
 file:/hive/lib/hive-exec-0.10.0-cdh4.6.0.jar, 
 file:/hive/lib/hive-metastore-0.10.0-cdh4.6.0.jar, 
 file:/hadoop/share/hadoop/mapreduce1/lib/hadoop-common-2.0.0-cdh4.6.0.jar, 
 file:/hive/lib/commons-logging-1.0.4.jar

 

 I don't know why it's not seeing the class -- it's in the guava jar. If
 anyone has had success with 0.12 version please let me know what jars need
 to be on the classpath. I think my Hive version might be too outdated but I
 don't control the metastore and I had success with Spark1.2 so I'm hoping...

Can Spark1.4 work with CDH4.6

2015-06-23 Thread Yana Kadiyska

Hi folks, I have been using Spark against an external Metastore service
which runs Hive with Cdh 4.6

In Spark 1.2, I was able to successfully connect by building with the
following:

./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0
-Phive-thriftserver -Phive-0.12.0

I see that in Spark 1.4 the Hive 0.12.0 profile is deprecated in favor of
spark.sql.hive.metastore.version/spark.sql.hive.metastore.jars

When I tried to use this setup spark-shell fails for me with the following
error:

15/06/23 18:18:07 INFO hive.HiveContext: Initializing
HiveMetastoreConnection version 0.12.0 using [Ljava.net.URL;@7b7a9a6c
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
com/google/common/base/Preconditions when creating Hive client using
classpath: file:/hive/lib/guava-11.0.2.jar,
file:/hive/lib/hive-exec-0.10.0-cdh4.6.0.jar,
file:/hive/lib/hive-metastore-0.10.0-cdh4.6.0.jar,
file:/hadoop/share/hadoop/mapreduce1/lib/hadoop-common-2.0.0-cdh4.6.0.jar,
file:/hive/lib/commons-logging-1.0.4.jar



I don't know why it's not seeing the class -- it's in the guava jar. If
anyone has had success with 0.12 version please let me know what jars need
to be on the classpath. I think my Hive version might be too outdated but I
don't control the metastore and I had success with Spark1.2 so I'm hoping...

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Yana Kadiyska

Can you show some code how you're doing the reads? Have you successfully
read other stuff from Cassandra (i.e. do you have a lot of experience with
this path and this particular table is causing issues or are you trying to
figure out the right way to do a read).

What version of Spark and Cassandra-connector are you using?
Also, what do you get for select count(*) from foo -- is that just as bad?

On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com
wrote:

 Hi, can somebody suggest me the way to reduce quantity of task?

 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com:

 Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes,
 Each of them has spark worker.
 The problem is that spark runs 869 task to read 3 lines: select bar from
 foo.
 I've tried these properties:

 #try to avoid 769 tasks per dummy select foo from bar qeury
 spark.cassandra.input.split.size_in_mb=32mb
 spark.cassandra.input.fetch.size_in_rows=1000
 spark.cassandra.input.split.size=1

 but it doesn't help.

 Here are  mean metrics for the job :
 input1= 8388608.0 TB
 input2 = -320 B
 input3 = -400 B

 I'm confused with input, there are only 3 rows in C* table.
 Definitely, I don't have 8388608.0 TB of data :)

ClassNotFound exception from closure

2015-06-16 Thread Yana Kadiyska

Hi folks,

running into a pretty strange issue -- I have a ClassNotFound exception
from a closure?! My code looks like this:

 val jRdd1 = table.map(cassRow={
  val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1))
  Row.fromSeq(lst)
})
println(sThis one worked ...+jRdd1.first.toString())

println(SILLY ---)
val sillyRDD=sc.parallelize(1 to 100)
val jRdd2 = sillyRDD.map(value={
  val cols = (0 to 2).map(i=foo).toList //3 foos per row
  println(sValus +cols.mkString(|))
  Row.fromSeq(cols)
})
println(sThis one worked too +jRdd2.first.toString())


and the exception I see goes:

This one worked ...[Some(1234),Some(1434123162)]
SILLY ---
Exception in thread main java.lang.ClassNotFoundException:
HardSparkJob$anonfun$3$anonfun$4
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.util.InnerClosureFinder$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at 
org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$clean(ClosureCleaner.scala:197)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$anonfun$map$1.apply(RDD.scala:293)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at HardSparkJob$.testUnionViaRDD(SparkTest.scala:127)
at HardSparkJob$.main(SparkTest.scala:104)
at HardSparkJob.main(SparkTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



I don't quite know what to make of this error. The stacktrace shows a
problem with my code at sillyRDD.map(SparkTest.scala:127)

I'm running Spark 1.4 CDH prebuilt with

bin/spark-submit --class HardSparkJob --master mesos://$MESOS_MASTER
../MyJar.jar

Any insight much appreciated

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Yana Kadiyska

When all else fails look at the source ;)

Looks like createJDBCTable is deprecated, but otherwise goes to the same
implementation as insertIntoJDBC...
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

You can also look at DataFrameWriter in the same package...Looks like all
that code will eventually write via JDBCWriteDetails in
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala...if
I'm reading this correctly you'll have simultaneous writes from each
partition but they don't appear to be otherwise batched (if you were
thinking bulk inserts)

On Mon, Jun 15, 2015 at 1:20 PM, Mohammad Tariq donta...@gmail.com wrote:

Hello list,

The method *insertIntoJDBC(url: String, table: String, overwrite:
Boolean)* provided by Spark DataFrame allows us to copy a DataFrame into
a JDBC DB table. Similar functionality is provided by the
*createJDBCTable(url:
String, table: String, allowExisting: Boolean) *method. But if you look
at the docs it says that *createJDBCTable *runs a *bunch of Insert
statements* in order to copy the data. While the docs of *insertIntoJDBC
*doesn't
have any such statement.

Could someone please shed some light on this? How exactly data gets
inserted using *insertIntoJDBC *method?

And if it is same as *createJDBCTable *method, then what exactly does *bunch
of Insert statements* mean? What's the criteria to decide the number
*inserts/bunch*? How are these bunches generated?

*An example* could be creating a DataFrame by reading all the files
stored in a given directory. If I just do *DataFrame.save()*, it'll
create the same number of output files as the input files. What'll happen
in case of *DataFrame.df.insertIntoJDBC()*?

I'm really sorry to be pest of questions, but I could net get much help by
Googling about this.

Thank you so much for your valuable time. really appreciate it.

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
http://about.me/mti

Re: Reopen Jira or New Jira

2015-06-11 Thread Yana Kadiyska

John, I took the liberty of reopening because I have sufficient JIRA
permissions (not sure if you do). It would be good if you can add relevant
comments/investigations there.

On Thu, Jun 11, 2015 at 8:34 AM, John Omernik j...@omernik.com wrote:

 Hey all, from my other post on Spark 1.3.1 issues, I think we found an
 issue related to a previous closed Jira (
 https://issues.apache.org/jira/browse/SPARK-1403)  Basically it looks
 like the threat context class loader is NULL which is causing the NPE in
 MapR and that's similar to posted Jira. New comments have been added to
 that Jira, but I am not sure how to trace back changes to determine why it
 was NULL in 0.9 apparently fixed in 1.0 working in 1.2 and then broken from
 1.2.2 onward.

 Is it possible to open a closed Jira? Should I open another? I think MapR
 is working to handle in their code, but I think someone (with more
 knowledge than I) should probably look into this on Spark as well due it
 appearing to have changed behavior between versions.

 Thoughts?

 John


 Previous Post

 All -

 I am facing and odd issue and I am not really sure where to go for support
 at this point.  I am running MapR which complicates things as it relates to
 Mesos, however this HAS worked in the past with no issues so I am stumped
 here.

 So for starters, here is what I am trying to run. This is a simple show
 tables using the Hive Context:

 from pyspark import SparkContext, SparkConf
 from pyspark.sql import SQLContext, Row, HiveContext
 sparkhc = HiveContext(sc)
 test = sparkhc.sql(show tables)
 for r in test.collect():
   print r

 When I run it on 1.3.1 using ./bin/pyspark --master local  This works with
 no issues.

 When I run it using Mesos with all the settings configured (as they had
 worked in the past) I get lost tasks and when I zoom in them, the error
 that is being reported is below.  Basically it's a NullPointerException on
 the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
 and compared both together, the class path, everything is exactly the same.
 Yet running in local mode works, and running in mesos fails.  Also of note,
 when the task is scheduled to run on the same node as when I run locally,
 that fails too! (Baffling).

 Ok, for comparison, how I configured Mesos was to download the mapr4
 package from spark.apache.org.  Using the exact same configuration file
 (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
 When I run this example with the mapr4 for 1.2.0 there is no issue in
 Mesos, everything runs as intended. Using the same package for 1.3.1 then
 it fails.

 (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
 well).

 So basically When I used 1.2.0 and followed a set of steps, it worked on
 Mesos and 1.3.1 fails.  Since this is a current version of Spark, MapR is
 supports 1.2.1 only.  (Still working on that).

 I guess I am at a loss right now on why this would be happening, any
 pointers on where I could look or what I could tweak would be greatly
 appreciated. Additionally, if there is something I could specifically draw
 to the attention of MapR on this problem please let me know, I am perplexed
 on the change from 1.2.0 to 1.3.1.

 Thank you,

 John




 Full Error on 1.3.1 on Mesos:
 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
 1060.3 MB java.lang.NullPointerException at
 com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
 com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
 com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
 org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
 at java.lang.Class.forName0(Native Method) at
 java.lang.Class.forName(Class.java:274) at
 org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
 at
 org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
 at
 org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
 at
 org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
 at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at
 org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at
 org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
 at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at
 org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:220) at
 org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) at
 org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959) at
 org.apache.spark.storage.BlockManager.(BlockManager.scala:104) at
 org.apache.spark.storage.BlockManager.(BlockManager.scala:179) at
 org.apache.spark.SparkEnv$.create(SparkEnv.scala:310) at
 org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:186) at

Re: Cassandra Submit

2015-06-10 Thread Yana Kadiyska

Do you build via maven or sbt? How do you submit your application -- do you
use local, standalone or mesos/yarn? Your jars as you originally listed
them seem right to me. Try this, from your ${SPARK_HOME}:

SPARK_CLASSPATH=spark-cassandra-connector_2.10-1.3.0-M1.jar:guava-jdk5-14.0.1.jar:cassandra-driver-core-2.1.5.jar:cassandra-thrift-2.1.3.jar:joda-time-2.3.jar
bin/spark-shell  --conf spark.cassandra.connection.host=127.0.0.1



where you'd have to provide the correct paths to the jars you're using.
This will drop you in a spark-shell

import com.datastax.spark.connector._

val test = sc.cassandraTable(your_keyspace,your_columnfamily)

test.first



I would first try to get this running in local mode, and if all works well
start looking at the jar you're distributing via spark-submit and the
classpaths of your executors (this collection of jars does work for me by
the way, so the show cassandra jars definitely work well with Spark 1.3.1).

On Wed, Jun 10, 2015 at 2:53 AM, Yasemin Kaya godo...@gmail.com wrote:

 It is really hell. How can I know which jars match? Which version of
 assembly fits me?

 2015-06-10 0:59 GMT+03:00 Mohammed Guller moham...@glassbeam.com:

  Looks like the real culprit is a library version mismatch:



 Caused by: java.lang.NoSuchMethodError:
 org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(Ljava/lang/String;I)Lorg/apache/thrift/transport/TTransport;

  at
 com.datastax.spark.connector.cql.DefaultConnectionFactory$.createThriftClient(CassandraConnectionFactory.scala:41)

  at
 com.datastax.spark.connector.cql.CassandraConnector.createThriftClient(CassandraConnector.scala:134)

  ... 28 more



 The Spark Cassandra Connector is  trying to use a method, which does not
 exists. That means your assembly jar has the wrong version of the library
 that SCC is trying to use. Welcome to jar hell!



 Mohammed



 *From:* Yasemin Kaya [mailto:godo...@gmail.com]
 *Sent:* Tuesday, June 9, 2015 12:24 PM
 *To:* Mohammed Guller
 *Cc:* Yana Kadiyska; Gerard Maas; user@spark.apache.org
 *Subject:* Re: Cassandra Submit



 My code https://gist.github.com/yaseminn/d77dd9baa6c3c43c7594 and
 exception https://gist.github.com/yaseminn/fdd6e5a6efa26219b4d3.



 and

 ~/cassandra/apache-cassandra-2.1.5$ *bin/cqlsh*

 Connected to Test Cluster at 127.0.0.1:9042.

 [cqlsh 5.0.1 | Cassandra 2.1.5 | CQL spec 3.2.0 | Native protocol v3]

 Use HELP for help.

 cqlsh use test;

 cqlsh:test select * from people;



 * id | name*

 *+-*

 *  5 |   eslem*

 *  1 | yasemin*

 *  8 | ali*

 *  2 |   busra*

 *  4 |   ilham*

 *  7 |   kubra*

 *  6 |tuba*

 *  9 |aslı*

 *  3 |  Andrew*



 (9 rows)

 cqlsh:test



 *bin/cassandra-cli -h 127.0.0.1 -p 9160*

 Connected to: Test Cluster on 127.0.0.1/9160

 Welcome to Cassandra CLI version 2.1.5



 The CLI is deprecated and will be removed in Cassandra 3.0.  Consider
 migrating to cqlsh.

 CQL is fully backwards compatible with Thrift data; see
 http://www.datastax.com/dev/blog/thrift-to-cql3



 Type 'help;' or '?' for help.

 Type 'quit;' or 'exit;' to quit.



 [default@unknown]





 yasemin



 2015-06-09 22:03 GMT+03:00 Mohammed Guller moham...@glassbeam.com:

 It is strange that writes works but read does not. If it was a Cassandra
 connectivity issue, then neither write or read would work. Perhaps the
 problem is somewhere else.



 Can you send the complete exception trace?



 Also, just to make sure that there is no DNS issue, try this:

 ~/cassandra/apache-cassandra-2.1.5$ bin/cassandra-cli -h 127.0.0.1 -p 9160



 Mohammed



 *From:* Yasemin Kaya [mailto:godo...@gmail.com]
 *Sent:* Tuesday, June 9, 2015 11:32 AM
 *To:* Yana Kadiyska
 *Cc:* Gerard Maas; Mohammed Guller; user@spark.apache.org
 *Subject:* Re: Cassandra Submit



 I removed core and streaming jar. And the exception still same.



 I tried what you said then results:



 ~/cassandra/apache-cassandra-2.1.5$ bin/cassandra-cli -h localhost -p 9160

 Connected to: Test Cluster on localhost/9160

 Welcome to Cassandra CLI version 2.1.5



 The CLI is deprecated and will be removed in Cassandra 3.0.  Consider
 migrating to cqlsh.

 CQL is fully backwards compatible with Thrift data; see
 http://www.datastax.com/dev/blog/thrift-to-cql3



 Type 'help;' or '?' for help.

 Type 'quit;' or 'exit;' to quit.



 [default@unknown]



 and



 ~/cassandra/apache-cassandra-2.1.5$ bin/cqlsh

 Connected to Test Cluster at 127.0.0.1:9042.

 [cqlsh 5.0.1 | Cassandra 2.1.5 | CQL spec 3.2.0 | Native protocol v3]

 Use HELP for help.

 cqlsh



 Thank you for your kind responses ...





 2015-06-09 20:59 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com:

 Hm, jars look ok, although it's a bit of a mess -- you have
 spark-assembly 1.3.0 but then core and streaming 1.3.1...It's generally a
 bad idea to mix versions. Spark-assembly bundless all spark packages, so
 either do them separately or use spark-assembly but don't mix like you've
 shown

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska

hm. Yeah, your port is good...have you seen this thread:
http://stackoverflow.com/questions/27288380/fail-to-use-spark-cassandra-connector
? It seems that you might be running into version mis-match issues?

What versions of Spark/Cassandra-connector are you trying to use?

On Tue, Jun 9, 2015 at 10:18 AM, Yasemin Kaya godo...@gmail.com wrote:

 Sorry my answer I hit terminal lsof -i:9160: result is

 lsof -i:9160
 COMMAND  PIDUSER   FD   TYPE DEVICE SIZE/OFF NODE NAME
 java7597 inosens  101u  IPv4  85754  0t0  TCP localhost:9160
 (LISTEN)

 so 9160 port is available or not ?

 2015-06-09 17:16 GMT+03:00 Yasemin Kaya godo...@gmail.com:

 Yes my cassandra is listening on 9160 I think. Actually I know from yaml
 file. The file includes :

 rpc_address: localhost
 # port for Thrift to listen for clients on
 rpc_port: 9160

 I check the port nc -z localhost 9160; echo $? it returns me 0. I
 think it close, should I open this port ?

 2015-06-09 16:55 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com:

 Is your cassandra installation actually listening on 9160?

 lsof -i :9160COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
 java29232 ykadiysk   69u  IPv4 42152497  0t0  TCP localhost:9160 
 (LISTEN)

 
 I am running an out-of-the box cassandra conf where

 rpc_address: localhost
 # port for Thrift to listen for clients on
 rpc_port: 9160



 On Tue, Jun 9, 2015 at 7:36 AM, Yasemin Kaya godo...@gmail.com wrote:

 I couldn't find any solution. I can write but I can't read from
 Cassandra.

 2015-06-09 8:52 GMT+03:00 Yasemin Kaya godo...@gmail.com:

 Thanks alot Mohammed, Gerard and Yana.
 I can write to table, but exception returns me. It says *Exception
 in thread main java.io.IOException: Failed to open thrift connection to
 Cassandra at 127.0.0.1:9160 http://127.0.0.1:9160*

 In yaml file :
 rpc_address: localhost
 rpc_port: 9160

 And at project :

 .set(spark.cassandra.connection.host, 127.0.0.1)
 .set(spark.cassandra.connection.rpc.port, 9160);

 or

 .set(spark.cassandra.connection.host, localhost)
 .set(spark.cassandra.connection.rpc.port, 9160);

 whatever I write setting,  I get same exception. Any help ??


 2015-06-08 18:23 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com:

 yes, whatever you put for listen_address in cassandra.yaml. Also, you
 should try to connect to your cassandra cluster via bin/cqlsh to make 
 sure
 you have connectivity before you try to make a a connection via spark.

 On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya godo...@gmail.com
 wrote:

 Hi,
 I run my project on local. How can find ip address of my cassandra
 host ? From cassandra.yaml or ??

 yasemin

 2015-06-08 11:27 GMT+03:00 Gerard Maas gerard.m...@gmail.com:

 ? = ip address of your cassandra host

 On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com
 wrote:

 Hi ,

 How can I find spark.cassandra.connection.host? And what should I
 change ? Should I change cassandra.yaml ?

 Error says me *Exception in thread main java.io.IOException:
 Failed to open native connection to Cassandra at {127.0.1.1}:9042*

 What should I add *SparkConf sparkConf = new
 SparkConf().setAppName(JavaApiDemo).set(**spark.driver.allowMultipleContexts,
 true).set(spark.cassandra.connection.host, ?);*

 Best
 yasemin

 2015-06-06 3:04 GMT+03:00 Mohammed Guller moham...@glassbeam.com
 :

  Check your spark.cassandra.connection.host setting. It should
 be pointing to one of your Cassandra nodes.



 Mohammed



 *From:* Yasemin Kaya [mailto:godo...@gmail.com]
 *Sent:* Friday, June 5, 2015 7:31 AM
 *To:* user@spark.apache.org
 *Subject:* Cassandra Submit



 Hi,



 I am using cassandraDB in my project. I had that error *Exception
 in thread main java.io.IOException: Failed to open native 
 connection to
 Cassandra at {127.0.1.1}:9042*



 I think I have to modify the submit line. What should I add or
 remove when I submit my project?



 Best,

 yasemin





 --

 hiç ender hiç




 --
 hiç ender hiç





 --
 hiç ender hiç





 --
 hiç ender hiç




 --
 hiç ender hiç





 --
 hiç ender hiç




 --
 hiç ender hiç

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska

Hm, jars look ok, although it's a bit of a mess -- you have spark-assembly
1.3.0 but then core and streaming 1.3.1...It's generally a bad idea to mix
versions. Spark-assembly bundless all spark packages, so either do them
separately or use spark-assembly but don't mix like you've shown.

As to the port issue -- what about this:

$bin/cassandra-cli -h localhost -p 9160
Connected to: Test Cluster on localhost/9160
Welcome to Cassandra CLI version 2.1.5


On Tue, Jun 9, 2015 at 1:29 PM, Yasemin Kaya godo...@gmail.com wrote:

 My jar files are:

 cassandra-driver-core-2.1.5.jar
 cassandra-thrift-2.1.3.jar
 guava-18.jar
 jsr166e-1.1.0.jar
 spark-assembly-1.3.0.jar
 spark-cassandra-connector_2.10-1.3.0-M1.jar
 spark-cassandra-connector-java_2.10-1.3.0-M1.jar
 spark-core_2.10-1.3.1.jar
 spark-streaming_2.10-1.3.1.jar

 And my code from datastax spark-cassandra-connector
 https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-demos/simple-demos/src/main/java/com/datastax/spark/connector/demo/JavaApiDemo.java
 .

 Thanx alot.
 yasemin

 2015-06-09 18:58 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com:

 hm. Yeah, your port is good...have you seen this thread:
 http://stackoverflow.com/questions/27288380/fail-to-use-spark-cassandra-connector
 ? It seems that you might be running into version mis-match issues?

 What versions of Spark/Cassandra-connector are you trying to use?

 On Tue, Jun 9, 2015 at 10:18 AM, Yasemin Kaya godo...@gmail.com wrote:

 Sorry my answer I hit terminal lsof -i:9160: result is

 lsof -i:9160
 COMMAND  PIDUSER   FD   TYPE DEVICE SIZE/OFF NODE NAME
 java7597 inosens  101u  IPv4  85754  0t0  TCP localhost:9160
 (LISTEN)

 so 9160 port is available or not ?

 2015-06-09 17:16 GMT+03:00 Yasemin Kaya godo...@gmail.com:

 Yes my cassandra is listening on 9160 I think. Actually I know from
 yaml file. The file includes :

 rpc_address: localhost
 # port for Thrift to listen for clients on
 rpc_port: 9160

 I check the port nc -z localhost 9160; echo $? it returns me 0. I
 think it close, should I open this port ?

 2015-06-09 16:55 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com:

 Is your cassandra installation actually listening on 9160?

 lsof -i :9160COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE 
 NAME
 java29232 ykadiysk   69u  IPv4 42152497  0t0  TCP localhost:9160 
 (LISTEN)

 
 I am running an out-of-the box cassandra conf where

 rpc_address: localhost
 # port for Thrift to listen for clients on
 rpc_port: 9160



 On Tue, Jun 9, 2015 at 7:36 AM, Yasemin Kaya godo...@gmail.com
 wrote:

 I couldn't find any solution. I can write but I can't read from
 Cassandra.

 2015-06-09 8:52 GMT+03:00 Yasemin Kaya godo...@gmail.com:

 Thanks alot Mohammed, Gerard and Yana.
 I can write to table, but exception returns me. It says *Exception
 in thread main java.io.IOException: Failed to open thrift connection 
 to
 Cassandra at 127.0.0.1:9160 http://127.0.0.1:9160*

 In yaml file :
 rpc_address: localhost
 rpc_port: 9160

 And at project :

 .set(spark.cassandra.connection.host, 127.0.0.1)
 .set(spark.cassandra.connection.rpc.port, 9160);

 or

 .set(spark.cassandra.connection.host, localhost)
 .set(spark.cassandra.connection.rpc.port, 9160);

 whatever I write setting,  I get same exception. Any help ??


 2015-06-08 18:23 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com:

 yes, whatever you put for listen_address in cassandra.yaml. Also,
 you should try to connect to your cassandra cluster via bin/cqlsh to 
 make
 sure you have connectivity before you try to make a a connection via 
 spark.

 On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya godo...@gmail.com
 wrote:

 Hi,
 I run my project on local. How can find ip address of my
 cassandra host ? From cassandra.yaml or ??

 yasemin

 2015-06-08 11:27 GMT+03:00 Gerard Maas gerard.m...@gmail.com:

 ? = ip address of your cassandra host

 On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com
 wrote:

 Hi ,

 How can I find spark.cassandra.connection.host? And what should
 I change ? Should I change cassandra.yaml ?

 Error says me *Exception in thread main java.io.IOException:
 Failed to open native connection to Cassandra at {127.0.1.1}:9042*

 What should I add *SparkConf sparkConf = new
 SparkConf().setAppName(JavaApiDemo).set(**spark.driver.allowMultipleContexts,
 true).set(spark.cassandra.connection.host, ?);*

 Best
 yasemin

 2015-06-06 3:04 GMT+03:00 Mohammed Guller 
 moham...@glassbeam.com:

  Check your spark.cassandra.connection.host setting. It should
 be pointing to one of your Cassandra nodes.



 Mohammed



 *From:* Yasemin Kaya [mailto:godo...@gmail.com]
 *Sent:* Friday, June 5, 2015 7:31 AM
 *To:* user@spark.apache.org
 *Subject:* Cassandra Submit



 Hi,



 I am using cassandraDB in my project. I had that error *Exception
 in thread main java.io.IOException: Failed to open native 
 connection to
 Cassandra at {127.0.1.1}:9042*



 I think I have

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska

Is your cassandra installation actually listening on 9160?

lsof -i :9160COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
java29232 ykadiysk   69u  IPv4 42152497  0t0  TCP
localhost:9160 (LISTEN)


I am running an out-of-the box cassandra conf where

rpc_address: localhost
# port for Thrift to listen for clients on
rpc_port: 9160



On Tue, Jun 9, 2015 at 7:36 AM, Yasemin Kaya godo...@gmail.com wrote:

 I couldn't find any solution. I can write but I can't read from Cassandra.

 2015-06-09 8:52 GMT+03:00 Yasemin Kaya godo...@gmail.com:

 Thanks alot Mohammed, Gerard and Yana.
 I can write to table, but exception returns me. It says *Exception in
 thread main java.io.IOException: Failed to open thrift connection to
 Cassandra at 127.0.0.1:9160 http://127.0.0.1:9160*

 In yaml file :
 rpc_address: localhost
 rpc_port: 9160

 And at project :

 .set(spark.cassandra.connection.host, 127.0.0.1)
 .set(spark.cassandra.connection.rpc.port, 9160);

 or

 .set(spark.cassandra.connection.host, localhost)
 .set(spark.cassandra.connection.rpc.port, 9160);

 whatever I write setting,  I get same exception. Any help ??


 2015-06-08 18:23 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com:

 yes, whatever you put for listen_address in cassandra.yaml. Also, you
 should try to connect to your cassandra cluster via bin/cqlsh to make sure
 you have connectivity before you try to make a a connection via spark.

 On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya godo...@gmail.com wrote:

 Hi,
 I run my project on local. How can find ip address of my cassandra
 host ? From cassandra.yaml or ??

 yasemin

 2015-06-08 11:27 GMT+03:00 Gerard Maas gerard.m...@gmail.com:

 ? = ip address of your cassandra host

 On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com
 wrote:

 Hi ,

 How can I find spark.cassandra.connection.host? And what should I
 change ? Should I change cassandra.yaml ?

 Error says me *Exception in thread main java.io.IOException:
 Failed to open native connection to Cassandra at {127.0.1.1}:9042*

 What should I add *SparkConf sparkConf = new
 SparkConf().setAppName(JavaApiDemo).set(**spark.driver.allowMultipleContexts,
 true).set(spark.cassandra.connection.host, ?);*

 Best
 yasemin

 2015-06-06 3:04 GMT+03:00 Mohammed Guller moham...@glassbeam.com:

  Check your spark.cassandra.connection.host setting. It should be
 pointing to one of your Cassandra nodes.



 Mohammed



 *From:* Yasemin Kaya [mailto:godo...@gmail.com]
 *Sent:* Friday, June 5, 2015 7:31 AM
 *To:* user@spark.apache.org
 *Subject:* Cassandra Submit



 Hi,



 I am using cassandraDB in my project. I had that error *Exception
 in thread main java.io.IOException: Failed to open native connection 
 to
 Cassandra at {127.0.1.1}:9042*



 I think I have to modify the submit line. What should I add or
 remove when I submit my project?



 Best,

 yasemin





 --

 hiç ender hiç




 --
 hiç ender hiç





 --
 hiç ender hiç





 --
 hiç ender hiç




 --
 hiç ender hiç

Re: Cassandra Submit

2015-06-08 Thread Yana Kadiyska

yes, whatever you put for listen_address in cassandra.yaml. Also, you
should try to connect to your cassandra cluster via bin/cqlsh to make sure
you have connectivity before you try to make a a connection via spark.

On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya godo...@gmail.com wrote:

 Hi,
 I run my project on local. How can find ip address of my cassandra host ?
 From cassandra.yaml or ??

 yasemin

 2015-06-08 11:27 GMT+03:00 Gerard Maas gerard.m...@gmail.com:

 ? = ip address of your cassandra host

 On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com wrote:

 Hi ,

 How can I find spark.cassandra.connection.host? And what should I change
 ? Should I change cassandra.yaml ?

 Error says me *Exception in thread main java.io.IOException: Failed
 to open native connection to Cassandra at {127.0.1.1}:9042*

 What should I add *SparkConf sparkConf = new
 SparkConf().setAppName(JavaApiDemo).set(**spark.driver.allowMultipleContexts,
 true).set(spark.cassandra.connection.host, ?);*

 Best
 yasemin

 2015-06-06 3:04 GMT+03:00 Mohammed Guller moham...@glassbeam.com:

  Check your spark.cassandra.connection.host setting. It should be
 pointing to one of your Cassandra nodes.



 Mohammed



 *From:* Yasemin Kaya [mailto:godo...@gmail.com]
 *Sent:* Friday, June 5, 2015 7:31 AM
 *To:* user@spark.apache.org
 *Subject:* Cassandra Submit



 Hi,



 I am using cassandraDB in my project. I had that error *Exception in
 thread main java.io.IOException: Failed to open native connection to
 Cassandra at {127.0.1.1}:9042*



 I think I have to modify the submit line. What should I add or remove
 when I submit my project?



 Best,

 yasemin





 --

 hiç ender hiç




 --
 hiç ender hiç





 --
 hiç ender hiç

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska

Can you run using spark-submit? What is happening is that you are running a
simple java program -- you've wrapped spark-core in your fat jar but at
runtime you likely need the whole Spark system in order to run your
application. I would mark the spark-core as provided(so you don't wrap it
in your fat jar)  and run via spark submit. If you insist on running via
java for whatever reason, see the runtime path that spark submit sets up
and make sure you include all of these jars when you run your app

On Tue, Jun 2, 2015 at 9:57 AM, Pa Rö paul.roewer1...@googlemail.com
wrote:

 okay, but how i can compile my app to run this without -Dconfig.file=alt_
 reference1.conf?

 2015-06-02 15:43 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com:

 This looks like your app is not finding your Typesafe config. The config
 should usually be placed in particular folder under your app to be seen
 correctly. If it's in a non-standard location you can
 pass  -Dconfig.file=alt_reference1.conf to java to tell it where to look.
 If this is a config that belogs to Spark and not your app, I'd recommend
 running your jar via spark submit (that should run) and dump out the
 classpath/variables that spark submit sets up...

 On Tue, Jun 2, 2015 at 6:58 AM, Pa Rö paul.roewer1...@googlemail.com
 wrote:

 hello community,

 i have build a jar file from my spark app with maven (mvn clean compile
 assembly:single) and the following pom file:

 project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=
 http://www.w3.org/2001/XMLSchema-instance;
   xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/xsd/maven-4.0.0.xsd;
   modelVersion4.0.0/modelVersion

   groupIdmgm.tp.bigdata/groupId
   artifactIdma-spark/artifactId
   version0.0.1-SNAPSHOT/version
   packagingjar/packaging

   namema-spark/name
   urlhttp://maven.apache.org/url

   properties
 project.build.sourceEncodingUTF-8/project.build.sourceEncoding
   /properties

   repositories
 repository
   idcloudera/id
   urlhttps://repository.cloudera.com/artifactory/cloudera-repos/
 /url
 /repository
   /repositories

   dependencies
 dependency
   groupIdjunit/groupId
   artifactIdjunit/artifactId
   version3.8.1/version
   scopetest/scope
 /dependency
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-core_2.10/artifactId
 version1.1.0-cdh5.2.5/version
 /dependency
 dependency
 groupIdmgm.tp.bigdata/groupId
 artifactIdma-commons/artifactId
 version0.0.1-SNAPSHOT/version
 /dependency
   /dependencies

   build
   plugins
 plugin
   artifactIdmaven-assembly-plugin/artifactId
   configuration
 archive
   manifest
 mainClassmgm.tp.bigdata.ma_spark.SparkMain/mainClass
   /manifest
 /archive
 descriptorRefs
   descriptorRefjar-with-dependencies/descriptorRef
 /descriptorRefs
   /configuration
 /plugin
   /plugins
 /build
 /project

 if i run my app with  java -jar
 ma-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar on terminal, i get the
 following error message:

 proewer@proewer-VirtualBox:~/Schreibtisch$ java -jar
 ma-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 2015-Jun-02 12:53:36,348 [main] org.apache.spark.util.Utils
  WARN  - Your hostname, proewer-VirtualBox resolves to a loopback
 address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0)
 2015-Jun-02 12:53:36,350 [main] org.apache.spark.util.Utils
  WARN  - Set SPARK_LOCAL_IP if you need to bind to another address
 2015-Jun-02 12:53:36,401 [main] org.apache.spark.SecurityManager
  INFO  - Changing view acls to: proewer
 2015-Jun-02 12:53:36,402 [main] org.apache.spark.SecurityManager
  INFO  - Changing modify acls to: proewer
 2015-Jun-02 12:53:36,403 [main] org.apache.spark.SecurityManager
  INFO  - SecurityManager: authentication disabled; ui acls disabled;
 users with view permissions: Set(proewer); users with modify permissions:
 Set(proewer)
 Exception in thread main com.typesafe.config.ConfigException$Missing:
 No configuration setting found for key 'akka.version'
 at
 com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
 at
 com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:197)
 at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:136)
 at akka.actor.ActorSystemImpl.init(ActorSystem.scala:470)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
 at
 org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska

This looks like your app is not finding your Typesafe config. The config
should usually be placed in particular folder under your app to be seen
correctly. If it's in a non-standard location you can
pass  -Dconfig.file=alt_reference1.conf to java to tell it where to look.
If this is a config that belogs to Spark and not your app, I'd recommend
running your jar via spark submit (that should run) and dump out the
classpath/variables that spark submit sets up...

On Tue, Jun 2, 2015 at 6:58 AM, Pa Rö paul.roewer1...@googlemail.com
wrote:

 hello community,

 i have build a jar file from my spark app with maven (mvn clean compile
 assembly:single) and the following pom file:

 project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=
 http://www.w3.org/2001/XMLSchema-instance;
   xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/xsd/maven-4.0.0.xsd;
   modelVersion4.0.0/modelVersion

   groupIdmgm.tp.bigdata/groupId
   artifactIdma-spark/artifactId
   version0.0.1-SNAPSHOT/version
   packagingjar/packaging

   namema-spark/name
   urlhttp://maven.apache.org/url

   properties
 project.build.sourceEncodingUTF-8/project.build.sourceEncoding
   /properties

   repositories
 repository
   idcloudera/id
   urlhttps://repository.cloudera.com/artifactory/cloudera-repos/
 /url
 /repository
   /repositories

   dependencies
 dependency
   groupIdjunit/groupId
   artifactIdjunit/artifactId
   version3.8.1/version
   scopetest/scope
 /dependency
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-core_2.10/artifactId
 version1.1.0-cdh5.2.5/version
 /dependency
 dependency
 groupIdmgm.tp.bigdata/groupId
 artifactIdma-commons/artifactId
 version0.0.1-SNAPSHOT/version
 /dependency
   /dependencies

   build
   plugins
 plugin
   artifactIdmaven-assembly-plugin/artifactId
   configuration
 archive
   manifest
 mainClassmgm.tp.bigdata.ma_spark.SparkMain/mainClass
   /manifest
 /archive
 descriptorRefs
   descriptorRefjar-with-dependencies/descriptorRef
 /descriptorRefs
   /configuration
 /plugin
   /plugins
 /build
 /project

 if i run my app with  java -jar
 ma-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar on terminal, i get the
 following error message:

 proewer@proewer-VirtualBox:~/Schreibtisch$ java -jar
 ma-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 2015-Jun-02 12:53:36,348 [main] org.apache.spark.util.Utils
  WARN  - Your hostname, proewer-VirtualBox resolves to a loopback address:
 127.0.1.1; using 10.0.2.15 instead (on interface eth0)
 2015-Jun-02 12:53:36,350 [main] org.apache.spark.util.Utils
  WARN  - Set SPARK_LOCAL_IP if you need to bind to another address
 2015-Jun-02 12:53:36,401 [main] org.apache.spark.SecurityManager
  INFO  - Changing view acls to: proewer
 2015-Jun-02 12:53:36,402 [main] org.apache.spark.SecurityManager
  INFO  - Changing modify acls to: proewer
 2015-Jun-02 12:53:36,403 [main] org.apache.spark.SecurityManager
  INFO  - SecurityManager: authentication disabled; ui acls disabled; users
 with view permissions: Set(proewer); users with modify permissions:
 Set(proewer)
 Exception in thread main com.typesafe.config.ConfigException$Missing: No
 configuration setting found for key 'akka.version'
 at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:142)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
 at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
 at
 com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:197)
 at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:136)
 at akka.actor.ActorSystemImpl.init(ActorSystem.scala:470)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
 at
 org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
 at
 org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1454)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1450)
 at
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:156)
 at org.apache.spark.SparkContext.init(SparkContext.scala:203)
 at
 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
 at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:38)

 what i do

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska

Like this...sqlContext should be a HiveContext instance

case class KeyValue(key: Int, value: String)
val df=sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF
df.registerTempTable(table)
sqlContext.sql(select percentile(key,0.5) from table).show()



On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way to compute a median on a column using Spark's Dataframe.
 I know you can use stats in a RDD but I'd rather stay within a dataframe.
 Hive seems to imply that using ntile one can compute percentiles,
 quartiles and therefore a median.
 Does anyone have experience with this ?

 Regards,

 Olivier.

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Yana Kadiyska

are you able to connect to your cassandra installation via

cassandra_home$ ./bin/cqlsh

This exception generally means that your cassandra instance is not
reachable/accessible

On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco antogia...@gmail.com
wrote:

 Hi all,
 I have in a single server installed spark 1.3.1 and cassandra 2.0.14
 I'm coding a simple java class for Spark Streaming as follow:

- reading header events from flume sink
- based on header I write the event body on navigation or transaction
table (cassandra)

 unfortunatly I get NoHostAvailableException, if I comment the code for
 saving one of the two tables everything works


 *here the code*

  public static void main(String[] args) {

 // Create a local StreamingContext with two working thread and
 batch interval of 1 second
  SparkConf conf = new
 SparkConf().setMaster(local[2]).setAppName(DWXNavigationApp);

  conf.set(spark.cassandra.connection.host, 127.0.0.1);
  conf.set(spark.cassandra.connection.native.port,9042);
  conf.set(spark.cassandra.output.batch.size.rows, 1);
  conf.set(spark.cassandra.output.concurrent.writes, 1);


  final JavaStreamingContext jssc = new JavaStreamingContext(conf,
 Durations.seconds(1));

  JavaReceiverInputDStreamSparkFlumeEvent flumeStreamNavig =
 FlumeUtils.createPollingStream(jssc, 127.0.0.1, );


  JavaDStreamString logRowsNavig = flumeStreamNavig.map(
  new FunctionSparkFlumeEvent,String(){

 @Override
 public String call(SparkFlumeEvent arg0) throws
 Exception {
 // TODO Auto-generated method stub0.

 MapCharSequence,CharSequence headers =
 arg0.event().getHeaders();

 ByteBuffer bytePayload = arg0.event().getBody();
 String s = headers.get(source_log).toString() +
 # + new String(bytePayload.array());
 System.out.println(RIGA:  + s);
 return s;
 }
  });


  logRowsNavig.foreachRDD(
  new FunctionJavaRDDString,Void(){
 @Override
 public Void call(JavaRDDString rows) throws
 Exception {

 if(!rows.isEmpty()){

  //String header =
 getHeaderFronRow(rows.collect());

  ListNavigation listNavigation = new
 ArrayListNavigation();
  ListTransaction listTransaction = new
 ArrayListTransaction();

  for(String row : rows.collect()){

  String header = row.substring(0,
 row.indexOf(#));

  if(header.contains(controller_log)){

  listNavigation.add(createNavigation(row));
  System.out.println(Added Element in
 Navigation List);

  }else if(header.contains(business_log)){

  listTransaction.add(createTransaction(row));
  System.out.println(Added Element in
 Transaction List);
  }

  }


  if(!listNavigation.isEmpty()){
  JavaRDDNavigation navigationRows=
 jssc.sparkContext().parallelize(listNavigation);


  javaFunctions(navigationRows).writerBuilder(cassandrasink, navigation,
 mapToRow(Navigation.class)).saveToCassandra();
  }


  if(!listTransaction.isEmpty()){
  JavaRDDTransaction transactionRows=
 jssc.sparkContext().parallelize(listTransaction);


  javaFunctions(transactionRows).writerBuilder(cassandrasink,
 transaction, mapToRow(Transaction.class)).saveToCassandra();

  }

 }
 return null;

 }
  });

 jssc.start();  // Start the computation
 jssc.awaitTermination();   // Wait for the computation to
 terminate
  }


 *here the exception*


 15/05/29 11:19:29 ERROR QueryExecutor: Failed to execute:

 com.datastax.spark.connector.writer.RichBatchStatement@ab76b83

 com.datastax.driver.core.exceptions.NoHostAvailableException: All

 host(s) tried for query failed (no host was tried)

  at


 com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:107)

  at

 com.datastax.driver.core.SessionManager.execute(SessionManager.java:538)

  at


 com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:577)

  at


 com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:119)

  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

  at

Re: hive external metastore connection timeout

2015-05-27 Thread Yana Kadiyska

I have not run into this particular issue but I'm not using latest bits in
production. However, testing your theory should be easy -- MySQL is just a
database, so you should be able to use a regular mysql client and see how
many connections are active. You can then compare to the maximum allowed
connections, and/or up the maximum allowed and see if you still hit the
ceiling. Hopefully someone will have a more direct answer for you...

also, not sure what query you're executing to reconnect, but try something
inexpensive like show tables -- I had connection timeout issues with the
actual query when pulling data...that was helped by setting
hive.metastore.client.socket.timeout higher...

On Wed, May 27, 2015 at 11:03 AM, jamborta jambo...@gmail.com wrote:

Hi all,

I am setting up a spark standalone server with an external hive metastore
(using mysql), there is an issue that after 5 minutes inactivity, if I try
to reconnect to the metastore (i.e. by executing a new query), it hangs for
about 10 mins then times out. My guess is that datanucleus does not close
the existing connections from the pool, but still tries to create new ones
for some reason.

Tried different type of connection pools, didn't help either.

thanks,

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/hive-external-metastore-connection-timeout-tp23052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Yana Kadiyska

:7077/user/Master...
 15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to
 akka.tcp://sparkMaster@mellyrn.local:7077:
 akka.remote.InvalidAssociation: Invalid address:
 akka.tcp://sparkMaster@mellyrn.local:7077
 15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
 15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been
 killed. Reason: All masters are unresponsive! Giving up.
 15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not
 initialized yet.
 1


 Even when successful, the time for the Master to come up has a
 surprisingly high variance. I am running on a single machine for which
 there is plenty of RAM. Note that was one problem before the present series
 :  if RAM is tight then the failure modes can be unpredictable. But now the
 RAM is not an issue: plenty available for both Master and Worker.

 Within the same hour period and starting/stopping maybe a dozen times,
 the startup time for the Master may be a few seconds up to  a couple to
 several minutes.

 2015-05-20 7:39 GMT-07:00 Yana Kadiyska yana.kadiy...@gmail.com:

 But if I'm reading his email correctly he's saying that:

 1. The master and slave are on the same box (so network hiccups are
 unlikely culprit)
 2. The failures are intermittent -- i.e program works for a while then
 worker gets disassociated...

 Is it possible that the master restarted? We used to have problems like
 this where we'd restart the master process, it won't be listening on 7077
 for some time, but the worker process is trying to connect and by the time
 the master is up the worker has given up...


 On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 Check whether the name can be resolved in the /etc/hosts file (or DNS)
 of the worker



 (the same btw applies for the Node where you run the driver app – all
 other nodes must be able to resolve its name)



 *From:* Stephen Boesch [mailto:java...@gmail.com]
 *Sent:* Wednesday, May 20, 2015 10:07 AM
 *To:* user
 *Subject:* Intermittent difficulties for Worker to contact Master on
 same machine in standalone





 What conditions would cause the following delays / failure for a
 standalone machine/cluster to have the Worker contact the Master?



 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at
 http://10.0.0.3:8081

 15/05/20 02:02:53 INFO Worker: Connecting to master
 akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...

 15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077

 15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt #
 1)

 ..

 ..

 15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt #
 3)

 15/05/20 02:03:26 INFO Worker: Connecting to master
 akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...

 15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077

Need some Cassandra integration help

2015-05-26 Thread Yana Kadiyska

Hi folks, for those of you working with Cassandra, wondering if anyone has
been successful processing a mix of Cassandra and hdfs data. I have a
dataset which is stored partially in HDFS and partially in Cassandra
(schema is the same in both places)

I am trying to do the following:

val dfHDFS = sqlContext.parquetFile(foo.parquet)
val cassDF = cassandraContext.sql(SELECT * FROM keyspace.user)

 dfHDFS.unionAll(cassDF).count



This is failing for me with the following -

Exception in thread main java.lang.AssertionError: assertion failed:
No plan for CassandraRelation
TableDef(yana_test,test_connector,ArrayBuffer(ColumnDef(key,PartitionKeyColumn,IntType,false)),ArrayBuff
er(),ArrayBuffer(ColumnDef(value,RegularColumn,VarCharType,false))), None, None

at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:278)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)
at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:292)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:152)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1092)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1090)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1096)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1096)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887)
at org.apache.spark.sql.DataFrame.count(DataFrame.scala:899)
at com.akamai.rum.test.unit.SparkTest.sparkPerfTest(SparkTest.java:123)
at com.akamai.rum.test.unit.SparkTest.main(SparkTest.java:37)



Is there a way to pull the data out of cassandra on each executor but not
try to push logic down into casandra?

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Yana Kadiyska

Todd, I don't have any answers for you...other than the file is actually
named spark-defaults.conf (not sure if you made a typo in the email or
misnamed the file...). Do any other options from that file get read?

I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3
.0-SNAPSHOT.jar from trunk or if they published a 1.3 drop somewhere -- I'm
just starting out with Cassandra and discovered
https://datastax-oss.atlassian.net/browse/SPARKC-98 is still open...

On Fri, May 22, 2015 at 6:15 PM, Todd Nist tsind...@gmail.com wrote:

 I'm using the spark-cassandra-connector from DataStax in a spark streaming
 job launched from my own driver.  It is connecting a a standalone cluster
 on my local box which has two worker running.

 This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT.  I have
 added the following entry to my $SPARK_HOME/conf/spark-default.conf:

 spark.executor.extraClassPath 
 /projects/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar


 When I start the master with, $SPARK_HOME/sbin/start-master.sh, it comes
 up just fine.  As do the two workers with the following command:

 Worker 1, port 8081:

 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
 spark://radtech.io:7077 --webui-port 8081 --cores 2

 Worker 2, port 8082

 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
 spark://radtech.io:7077 --webui-port 8082 --cores 2

 When I execute the Driver connecting the the master:

 sbt app/run -Dspark.master=spark://radtech.io:7077

 It starts up, but when the executors are launched they do not include the
 entry in the spark.executor.extraClassPath:

 15/05/22 17:35:26 INFO Worker: Asked to launch executor 
 app-20150522173526-/0 for KillrWeatherApp$15/05/22 17:35:26 INFO 
 ExecutorRunner: Launch command: java -cp 
 /usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar
  -Dspark.driver.port=55932 -Xms512M -Xmx512M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
 akka.tcp://sparkDriver@192.168.1.3:55932/user/CoarseGrainedScheduler 
 --executor-id 0 --hostname 192.168.1.3 --cores 2 --app-id 
 app-20150522173526- --worker-url 
 akka.tcp://sparkWorker@192.168.1.3:55923/user/Worker



 which will then cause the executor to fail with a ClassNotFoundException,
 which I would expect:

 [WARN] [2015-05-22 17:38:18,035] [org.apache.spark.scheduler.TaskSetManager]: 
 Lost task 0.0 in stage 2.0 (TID 23, 192.168.1.3): 
 java.lang.ClassNotFoundException: 
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:344)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 I also notice that some of the entires on the executor classpath are
 duplicated?  This is a newly installed spark-1.3.1-bin-hadoop2.6
  standalone

Re: Unable to use hive queries with constants in predicates

2015-05-21 Thread Yana Kadiyska

I have not seen this error but have seen another user have weird parser
issues before:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E

I would attach a debugger and see what is going on -- if I'm looking at the
right place (
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/0.13.1/org/apache/hadoop/hive/ql/parse/HiveParser.java#HiveParser)
token 294 is RCURLY...which doesnt make much sense...

On Thu, May 21, 2015 at 2:10 AM, Devarajan Srinivasan 
devathecool1...@gmail.com wrote:

 Hi,

I was testing spark to read data from hive using HiveContext. I got the
 following error, when I used a simple query with constants in predicates.

   I am using spark 1.3*. *Anyone encountered error like this ??


 *Error:*


 Exception in thread main org.apache.spark.sql.AnalysisException:
 Unsupported language features in query: SELECT * from test_table where
 daily_partition='20150101'
 TOK_QUERY 1, 0,20, 81
   TOK_FROM 1, 10,14, 81
 TOK_TABREF 1, 12,14, 81
   TOK_TABNAME 1, 12,14, 81
 everest_marts_test 1, 12,12, 81
 voice_cdr 1, 14,14, 100
   TOK_INSERT 0, -1,-1, 0
 TOK_DESTINATION 0, -1,-1, 0
   TOK_DIR 0, -1,-1, 0
 TOK_TMP_FILE 0, -1,-1, 0
 TOK_SELECT 1, 0,8, 7
   TOK_SELEXPR 1, 2,2, 7
 TOK_TABLE_OR_COL 1, 2,2, 7
   callingpartynumber 1, 2,2, 7
   TOK_SELEXPR 1, 4,4, 26
 TOK_TABLE_OR_COL 1, 4,4, 26
   calledpartynumber 1, 4,4, 26
   TOK_SELEXPR 1, 6,6, 44
 TOK_TABLE_OR_COL 1, 6,6, 44
   chargingtime 1, 6,6, 44
   TOK_SELEXPR 1, 8,8, 57
 TOK_TABLE_OR_COL 1, 8,8, 57
   call_direction_key 1, 8,8, 57
 TOK_WHERE 1, 16,20, 131
   = 1, 18,20, 131
 TOK_TABLE_OR_COL 1, 18,18, 116
   daily_partition 1, 18,18, 116
 '20150101' 1, 20,20, 132

 scala.NotImplementedError: No parse rules for ASTNode type: 294, text:
 '20150101' :
 '20150101' 1, 20,20, 132
  +
 org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1261)
   ;
 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:261)
 at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$
 hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
 at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$
 hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.
 scala:136)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.
 scala:135)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.
 apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.
 apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.
 scala:222)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$
 append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$
 append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$Failure.append(
 Parsers.scala:202)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$
 append$1.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$
 append$1.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.
 scala:222)
 at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$
 apply$14.apply(Parsers.scala:891)
 at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$
 apply$14.apply(Parsers.scala:891)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.
 scala:890)
 at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(
 PackratParsers.scala:110)
 at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(
 AbstractSparkSQLParser.scala:38)
 at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
 at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
 at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$
 SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
 at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$
 SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.
 scala:136)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.
 scala:135)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.
 apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.
 apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.
 scala:222)
 at scala.util.parsing.combinator.Parsers$Parser$$anonfun$
 append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at

Re: Storing data in MySQL from spark hive tables

2015-05-20 Thread Yana Kadiyska

I'm afraid you misunderstand the purpose of hive-site.xml. It configures
access to the Hive metastore. You can read more here:
http://www.hadoopmaterial.com/2013/11/metastore.html.

So the MySQL DB in hive-site.xml would be used to store hive-specific data
such as schema info, partition info, etc.

Now, for what you want to do, you can search the user list -- I know there
have been posts about Postgres but you can do the same with MySQL. The idea
is to create an object holding a connection pool (so each of your executors
would have its own instance), or alternately, to open a connection within
mapPartitions (so you don't end up with a ton of connections). But the
write to a DB is largely a manual process -- open a connection, create a
statement, sync the data. If your data is small enough you probably could
just collect on the driver and write...though that would certainly be
slower than writing in parallel from each executor.

On Wed, May 20, 2015 at 5:48 PM, roni roni.epi...@gmail.com wrote:

 Hi ,
 I am trying to setup the hive metastore and mysql DB connection.
  I have a spark cluster and I ran some programs and I have data stored in
 some hive tables.
 Now I want to store this data into Mysql  so that it is available for
 further processing.

 I setup the hive-site.xml file.

 ?xml version=1.0?

 ?xml-stylesheet type=text/xsl href=configuration.xsl?


 configuration

   property

 namehive.semantic.analyzer.factory.impl/name

 valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value

   /property


   property

 namehive.metastore.sasl.enabled/name

 valuefalse/value

   /property


   property

 namehive.server2.authentication/name

 valueNONE/value

   /property


   property

 namehive.server2.enable.doAs/name

 valuetrue/value

   /property


   property

 namehive.warehouse.subdir.inherit.perms/name

 valuetrue/value

   /property


   property

 namehive.metastore.schema.verification/name

 valuefalse/value

   /property


   property

 namejavax.jdo.option.ConnectionURL/name

 valuejdbc:mysql://*ip address*
 :3306/metastore_db?createDatabaseIfNotExist=true/value

 descriptionmetadata is stored in a MySQL server/description

   /property


   property

 namejavax.jdo.option.ConnectionDriverName/name

 valuecom.mysql.jdbc.Driver/value

 descriptionMySQL JDBC driver class/description

   /property


   property

 namejavax.jdo.option.ConnectionUserName/name

 valueroot/value

   /property


   property

 namejavax.jdo.option.ConnectionPassword/name

 value/value

   /property

   property

 namehive.metastore.warehouse.dir/name

 value/user/${user.name}/hive-warehouse/value

 descriptionlocation of default database for
 the warehouse/description

 /property


 /configuration
  --
 My mysql server is on a separate server than where my spark server is . If
 I use mySQLWorkbench , I use a SSH connection  with a certificate file to
 connect .
 How do I specify all that information from spark to the DB ?
 I want to store the data generated by my spark program into mysql.
 Thanks
 _R

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Yana Kadiyska

But if I'm reading his email correctly he's saying that:

1. The master and slave are on the same box (so network hiccups are
unlikely culprit)
2. The failures are intermittent -- i.e program works for a while then
worker gets disassociated...

Is it possible that the master restarted? We used to have problems like
this where we'd restart the master process, it won't be listening on 7077
for some time, but the worker process is trying to connect and by the time
the master is up the worker has given up...


On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Check whether the name can be resolved in the /etc/hosts file (or DNS) of
 the worker



 (the same btw applies for the Node where you run the driver app – all
 other nodes must be able to resolve its name)



 *From:* Stephen Boesch [mailto:java...@gmail.com]
 *Sent:* Wednesday, May 20, 2015 10:07 AM
 *To:* user
 *Subject:* Intermittent difficulties for Worker to contact Master on same
 machine in standalone





 What conditions would cause the following delays / failure for a
 standalone machine/cluster to have the Worker contact the Master?



 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at
 http://10.0.0.3:8081

 15/05/20 02:02:53 INFO Worker: Connecting to master
 akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...

 15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077

 15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt # 1)

 ..

 ..

 15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt # 3)

 15/05/20 02:03:26 INFO Worker: Connecting to master
 akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...

 15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable
 remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
 now gated for 5000 ms, all messages to this address will be delivered to
 dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077

Re: store hive metastore on persistent store

2015-05-16 Thread Yana Kadiyska

oh...metastore_db location is not controlled by
hive.metastore.warehouse.dir -- one is the location of your metastore DB,
the other is the physical location of your stored data. Checkout this SO
thread:
http://stackoverflow.com/questions/13624893/metastore-db-created-wherever-i-run-hive


On Sat, May 16, 2015 at 9:07 AM, Tamas Jambor jambo...@gmail.com wrote:

 Gave it another try - it seems that it picks up the variable and prints
 out the correct value, but still puts the metatore_db folder in the current
 directory, regardless.

 On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor jambo...@gmail.com wrote:

 Thank you for the reply.

 I have tried your experiment, it seems that it does not print the
 settings out in spark-shell (I'm using 1.3 by the way).

 Strangely I have been experimenting with an SQL connection instead, which
 works after all (still if I go to spark-shell and try to print out the SQL
 settings that I put in hive-site.xml, it does not print them).


 On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 My point was more to how to verify that properties are picked up from
 the hive-site.xml file. You don't really need hive.metastore.uris if
 you're not running against an external metastore.  I just did an
 experiment with warehouse.dir.

 My hive-site.xml looks like this:

 configuration
 property
 namehive.metastore.warehouse.dir/name
 value/home/ykadiysk/Github/warehouse_dir/value
 descriptionlocation of default database for the 
 warehouse/description
 /property
 /configuration

 

 and spark-shell code:

 scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
 hc: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@3036c16f

 scala hc.sql(show tables).collect
 15/05/15 14:12:57 INFO HiveMetaStore: 0: Opening raw store with 
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 14:12:57 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 14:12:57 INFO Persistence: Property datanucleus.cache.level2 
 unknown - will be ignored
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:13:03 INFO ObjectStore: Setting MetaStore object pin classes 
 with 
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 14:13:03 INFO ObjectStore: Initialized ObjectStore
 15/05/15 14:13:04 WARN ObjectStore: Version information not found in 
 metastore. hive.metastore.schema.verification is not enabled so recording 
 the schema version 0.12.0-protobuf-2.5
 15/05/15 14:13:05 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO audit: ugi=ykadiysk  ip=unknown-ip-addr  
 cmd=get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as 
 embedded-only so does not have its own datastore table.
 res0: Array[org.apache.spark.sql.Row] = Array()

 scala hc.getConf(hive.metastore.warehouse.dir)
 res1: String = /home/ykadiysk/Github/warehouse_dir

 

 I have not tried an HDFS path but you should be at least able to verify
 that the variable is being read. It might be that your value is read but is
 otherwise not liked...

 On Fri, May 15, 2015 at 2:03 PM, Tamas Jambor jambo...@gmail.com
 wrote:

 thanks for the reply. I am trying to use it without hive setup
 (spark-standalone), so it prints something like this:

 hive_ctx.sql(show tables).collect()
 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2
 unknown - will be ignored
 15/05/15 17:59:04 INFO Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block
 manager :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819)

   [0/1844]
 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin
 classes with
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming
 we are not on mysql: Lexical error at line 1, column 5.  Encountered: @
 (64), after : .
 15/05/15 17:59:20 INFO Datastore

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska

This should work. Which version of Spark are you using? Here is what I do
-- make sure hive-site.xml is in the conf directory of the machine you're
using the driver from. Now let's run spark-shell from that machine:

scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
hc: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@6e9f8f26

scala hc.sql(show tables).collect
15/05/15 09:34:17 INFO metastore: Trying to connect to metastore with
URI thrift://hostname.com:9083  -- here should be a value
from your hive-site.xml
15/05/15 09:34:17 INFO metastore: Waiting 1 seconds before next
connection attempt.
15/05/15 09:34:18 INFO metastore: Connected to metastore.
res0: Array[org.apache.spark.sql.Row] = Array([table1,false],

scala hc.getConf(hive.metastore.uris)
res13: String = thrift://hostname.com:9083

scala hc.getConf(hive.metastore.warehouse.dir)
res14: String = /user/hive/warehouse



The first line tells you which metastore it's trying to connect to -- this
should be the string specified under hive.metastore.uris property in your
hive-site.xml file. I have not mucked with warehouse.dir too much but I
know that the value of the metastore URI is in fact picked up from there as
I regularly point to different systems...


On Thu, May 14, 2015 at 6:26 PM, Tamas Jambor jambo...@gmail.com wrote:

 I have tried to put the hive-site.xml file in the conf/ directory with,
 seems it is not picking up from there.


 On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust mich...@databricks.com
 wrote:

 You can configure Spark SQLs hive interaction by placing a hive-site.xml
 file in the conf/ directory.

 On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote:

 Hi all,

 is it possible to set hive.metastore.warehouse.dir, that is internally
 create by spark, to be stored externally (e.g. s3 on aws or wasb on
 azure)?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/store-hive-metastore-on-persistent-store-tp22891.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

SPARK-4412 regressed?

2015-05-15 Thread Yana Kadiyska

Hi, two questions

1. Can regular JIRA users reopen bugs -- I can open a new issue but it does
not appear that I can reopen issues. What is the proper protocol to follow
if we discover regressions?

2. I believe SPARK-4412 regressed in Spark 1.3.1, according to this SO
thread possibly even in 1.3.0
http://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska

: azure-file-system metrics system
 started
 15/05/15 17:59:33 INFO HiveMetaStore: Added admin role in metastore
 15/05/15 17:59:34 INFO HiveMetaStore: Added public role in metastore
 15/05/15 17:59:34 INFO HiveMetaStore: No user is added in admin role,
 since config is empty
 15/05/15 17:59:35 INFO SessionState: No Tez session required at this
 point. hive.execution.engine=mr.
 15/05/15 17:59:37 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
 15/05/15 17:59:37 INFO audit: ugi=testuser ip=unknown-ip-addr
  cmd=get_tables: db=default pat=.*

 not sure what to put in hive.metastore.uris in this case?


 On Fri, May 15, 2015 at 2:52 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 This should work. Which version of Spark are you using? Here is what I do
 -- make sure hive-site.xml is in the conf directory of the machine you're
 using the driver from. Now let's run spark-shell from that machine:

 scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
 hc: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@6e9f8f26

 scala hc.sql(show tables).collect
 15/05/15 09:34:17 INFO metastore: Trying to connect to metastore with URI 
 thrift://hostname.com:9083  -- here should be a value from your 
 hive-site.xml
 15/05/15 09:34:17 INFO metastore: Waiting 1 seconds before next connection 
 attempt.
 15/05/15 09:34:18 INFO metastore: Connected to metastore.
 res0: Array[org.apache.spark.sql.Row] = Array([table1,false],

 scala hc.getConf(hive.metastore.uris)
 res13: String = thrift://hostname.com:9083

 scala hc.getConf(hive.metastore.warehouse.dir)
 res14: String = /user/hive/warehouse

 

 The first line tells you which metastore it's trying to connect to --
 this should be the string specified under hive.metastore.uris property in
 your hive-site.xml file. I have not mucked with warehouse.dir too much but
 I know that the value of the metastore URI is in fact picked up from there
 as I regularly point to different systems...


 On Thu, May 14, 2015 at 6:26 PM, Tamas Jambor jambo...@gmail.com wrote:

 I have tried to put the hive-site.xml file in the conf/ directory with,
 seems it is not picking up from there.


 On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust 
 mich...@databricks.com wrote:

 You can configure Spark SQLs hive interaction by placing a
 hive-site.xml file in the conf/ directory.

 On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote:

 Hi all,

 is it possible to set hive.metastore.warehouse.dir, that is internally
 create by spark, to be stored externally (e.g. s3 on aws or wasb on
 azure)?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/store-hive-metastore-on-persistent-store-tp22891.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

[SparkSQL] Partition Autodiscovery (Spark 1.3)

2015-05-12 Thread Yana Kadiyska

Hi folks, I'm trying to use Automatic partition discovery as descibed here:

https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

/data/year=2014/file.parquet/data/year=2015/file.parquet
…
SELECT * FROM table WHERE year = 2015



I have an official 1.3.1 CDH4 build and did the following:

scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
hc: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@2564dce9

scala val df 
=hc.parquetFile(/r/warehouse/hive/pkey=-2013-12/another_path_part/part-0-r-0.snappy.parquet)

scala df.columns
res0: Array[String] = Array(but does not contain a pkey column)

df.registerTempTable(table)
scala hc.sql(SELECT count(*) FROM table WHERE pkey='-2013-12')
15/05/12 16:27:32 INFO ParseDriver: Parsing command: SELECT count(*)
FROM table WHERE pkey='-2013-12'
15/05/12 16:27:33 INFO ParseDriver: Parse Completed
org.apache.spark.sql.AnalysisException: cannot resolve 'pkey' given
input columns 


So in my case the dataframe that resulted from the parquet file did not
result in an added, filterable column. Am I using this wrong? Or is my
on-disk structure not correct?

Any insight appreciated.

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska

Here is the JIRA:  https://issues.apache.org/jira/browse/SPARK-3928
Looks like for now you'd have to list the full paths...I don't see a
comment from an official spark committer so still not sure if this is a bug
or design, but it seems to be the current state of affairs.

On Thu, May 7, 2015 at 8:43 AM, yana yana.kadiy...@gmail.com wrote:

 I believe this is a regression. Does not work for me either. There is a
 Jira on parquet wildcards which is resolved, I'll see about getting it
 reopened


 Sent on the new Sprint Network from my Samsung Galaxy S®4.


  Original message 
 From: Vaxuki
 Date:05/07/2015 7:38 AM (GMT-05:00)
 To: Olivier Girardot
 Cc: user@spark.apache.org
 Subject: Re: Spark 1.3.1 and Parquet Partitions

 Olivier
 Nope. Wildcard extensions don't work I am debugging the code to figure out
 what's wrong I know I am using 1.3.1 for sure

 Pardon typos...

 On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote:

 hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ?

 Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit :

 Spark 1.3.1 -
 i have a parquet file on hdfs partitioned by some string looking like this
 /dataset/city=London/data.parquet
 /dataset/city=NewYork/data.parquet
 /dataset/city=Paris/data.paruqet
 ….

 I am trying to get to load it using sqlContext using
 sqlcontext.parquetFile(
 hdfs://some ip:8029/dataset/ what do i put here 

 No leads so far. is there i can load the partitions ? I am running on
 cluster and not local..
 -V



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska

Hi folks, we have been using the a JDBC connection to Spark's Thrift Server
so far and using JDBC prepared statements to escape potentially malicious
user input.

I am trying to port our code directly to HiveContext now (i.e. eliminate
the use of Thrift Server) and I am not quite sure how to generate a
properly escaped sql statement...

Wondering if someone has ideas on proper way to do this?

To be concrete, I would love to issue this statement

 val df = myHiveCtxt.(sqlText)


but I would like to defend against potential SQL injection.

[ThriftServer] Urgent -- very slow Metastore query from Spark

2015-04-16 Thread Yana Kadiyska

Hi Sparkers,

hoping for insight here:
running a simple describe mytable here where mytable is a partitioned Hive
table.

Spark produces the following times:

Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02,
SQL query: 72.831, Reading results: 0.189



Whereas Hive over the same metastore shows:

Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL
query: 0.204, Reading results: 0.236



I am looking at the metastore as Thriftserver couldn't start up at all
until I increased

hive.metastore.client.socket.timeout to 600


Why would metastore access from Spark's Thriftserver be so much worse than
from Hive?


The issue is pretty urgent for me as I ran into this problem during a push
to a production cluster (QA metastore table is smaller and it's a different
cluster that didn't show this).


Is there a known issue with metastore access -- I only see
https://issues.apache.org/jira/browse/SPARK-5923 but I'm using Postgres. We
are upgrading from Shark and both Hive and Shark process this a lot faster.


Describe table in itself is not a critical query for me but I am
experiencing performance hit in other queries and I'm suspecting the
metastore interaction (e.g.
https://www.mail-archive.com/user@spark.apache.org/msg26242.html)

[SparkSQL; Thriftserver] Help tracking missing 5 minutes

2015-04-15 Thread Yana Kadiyska

Hi Spark users,

Trying to upgrade to Spark1.2 and running into the following

seeing some very slow queries and wondering if someone can point me in the
right direction for debugging. My Spark UI shows a job with duration 15s
(see attached screenshot). Which would be great but client side measurement
shows the query takes just over 4 min

15/04/15 16:34:59 INFO ParseDriver: Parsing command: ***my query here***
15/04/15 16:38:28 INFO ParquetTypesConverter: Falling back to schema
conversion from Parquet types;
15/04/15 16:38:29 INFO DAGScheduler: Got job 6 (collect at
SparkPlan.scala:84) with 401 output partitions (allowLocal=false)



So there is a gap of almost 4 min between the parse and the next line I can
identify as relating to this job. Can someone shed some light on what
happens between the parse and the DAG scheduler? The spark UI also shows
the submitted time as 16:38 which leads me to believe it counts time from
when the scheduler gets the job...but what happens before then?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[ThriftServer] User permissions warning

2015-04-08 Thread Yana Kadiyska

Hi folks, I am noticing a pesky and persistent warning in my logs (this is
from Spark 1.2.1):


15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception
trying to get groups for user anonymous
org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user

at org.apache.hadoop.util.Shell.runCommand(Shell.java:261)
at org.apache.hadoop.util.Shell.run(Shell.java:188)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:381)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:467)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:450)
at 
org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:86)
at 
org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:55)
at 
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:50)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:89)
at 
org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1292)
at 
org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:62)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
at 
org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:365)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:278)



I cannot figure out what I might be missing -- the thrift server is started
via sbin/start-thriftserver --master ..., I can see that the process is
running under my user. I don't have any functional issues but this is
annoying (filling up my logs/making it heard to read). Can someone give me
pointers on what to check?
Things I've tried:

1. hive.server2.enable.doAs is NOT set in hive-site.xml so I expect user
should at least show up as my id, not anonymous
2.export HADOOP_USER_NAME=someusername -- error still shows up about
anonymous

Curious if anyone has solved this

DataFrame -- help with encoding factor variables

2015-04-06 Thread Yana Kadiyska

Hi folks, currently have a DF that has a factor variable -- say gender.

I am hoping to use the RandomForest algorithm on this data an it appears
that this needs to be converted to RDD[LabeledPoint] first -- i.e. all
features need to be double-encoded.

I see https://issues.apache.org/jira/browse/SPARK-5888 is still open but
was wondering what is the recommended way to add a column? I can think of

featuresDF.map { case Row(f1,f2,f3) =(f1,f2,if (f3=='male') 0 else
1,if (f3=='female') 0 else 1) }.toDF(f1,f2,f3_dummy,f3_dummy2)



but that isn't ideal as I already have 80+ features in that dataframe so
the matching itself is a pain -- thinking there's got to be a better way to
append |levels| number of columns and select all columns but f3?

I see a withColumn method but no constructor to create a column...should I
be creating the dummy features in a new dataframe and then select them out
of there to get a Column?

Any pointers are appreciated -- I'm sure I'm not the first person to
attempt this, just unsure of the least painful way to achieve.

Re: Spark Avarage

2015-04-06 Thread Yana Kadiyska

If you're going to do it this way, I would ouput dayOfdate.substring(0,7),
i.e. the month part, and instead of weatherCond, you can use
(month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD:
RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in
multiple Spark examples..You'd end up with the sum for each metric and in
the end divide by the count to get the avg of each column. If you want to
use Algebird you can output (month,(Avg(minDeg),Avg(maxDeg),Avg(meanDeg)))
and then all your reduce operations would be _+_.

With that said, if you're using spark 1.3 check out
https://github.com/databricks/spark-csv (you should likely use the CSV
package anyway, even with a lower version of Spark) and
https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame
(esp. the example at the top of the file). You'd just need .groupByand .agg
if you setup your dataframe column that you're grouping by to contain just
the -MM portion of your date string.

On Mon, Apr 6, 2015 at 10:50 AM, barisak baris.akg...@gmail.com wrote:

 Hi

 I have a class in above desc.

 case class weatherCond(dayOfdate: String, minDeg: Int, maxDeg: Int,
 meanDeg:
 Int)

 I am reading the data from csv file and I put this data into weatherCond
 class with this code

 val weathersRDD = sc.textFile(weather.csv).map {
   line =
 val Array(dayOfdate, minDeg, maxDeg, meanDeg) =
 line.replaceAll(\,).trim.split(,)
 weatherCond(dayOfdate, minDeg.toInt, maxDeg.toInt, meanDeg.toInt)
 }

 the question is ; how can I average the minDeg, maxDeg and meanDeg values
 for each month ;

 The data set example

 day, min, max , mean
 2014-03-17,-3,5,5
 2014-03-18,6,7,7
 2014-03-19,6,14,10

 result has to be (2014-03,   3,   8.6   ,7.3) -- (Average for 2014 - 03
 )

 Thanks





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Avarage-tp22391.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska

Hi folks, having some seemingly noob issues with the dataframe API.

I have a DF which came from the csv package.

1. What would be an easy way to cast a column to a given type -- my DF
columns are all typed as strings coming from a csv. I see a schema getter
but not setter on DF

2. I am trying to use the syntax used in various blog posts but can't
figure out how to reference a column by name:

scala df.filter(customer_id!=)
console:23: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame and
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (Boolean)
  df.filter(customer_id!=)


3. what would be the recommended way to drop a row containing a null value
-- is it possible to do this:
scala df.filter(customer_id IS NOT NULL)

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska

You might also want to see if TaskScheduler helps with that. I have not
used it with Windows 2008 R2 but it generally does allow you to schedule a
bat file to run on startup

On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

 Thanks for the suggestion. I will try that.



 Ningjun





 From: Silvio Fiorito [mailto:silvio.fior...@granturing.com]
 Sent: Wednesday, March 11, 2015 12:40 AM
 To: Wang, Ningjun (LNG-NPV); user@spark.apache.org
 Subject: Re: Is it possible to use windows service to start and stop
spark standalone cluster



 Have you tried Apache Daemon?
http://commons.apache.org/proper/commons-daemon/procrun.html



 From: Wang, Ningjun (LNG-NPV)
 Date: Tuesday, March 10, 2015 at 11:47 PM
 To: user@spark.apache.org
 Subject: Is it possible to use windows service to start and stop spark
standalone cluster



 We are using spark stand alone cluster on Windows 2008 R2. I can start
spark clusters by open an command prompt and run the following



 bin\spark-class.cmd org.apache.spark.deploy.master.Master



 bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://
mywin.mydomain.com:7077



 I can stop spark cluster by pressing Ctril-C.



 The problem is that if the machine is reboot, I have to manually start
the spark cluster again as above. Is it possible to use a windows service
to start cluster? This way when the machine is reboot, the windows service
will automatically restart spark cluster. How to stop spark cluster using
windows service is also a challenge.



 Please advise.



 Thanks



 Ningjun

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska

I was actually just able to reproduce the  issue. I do wonder if this is a
bug -- the docs say When not configured by the hive-site.xml, the context
automatically creates metastore_db and warehouse in the current directory.
But as you can see in from the message warehouse is not in the current
directory, it is under /user/hive. In my case this directory was owned by
'root' and noone else had write permissions. Changing the permissions works
if you need to get unblocked quickly...But it does seem like a bug to me...


On Fri, Feb 27, 2015 at 11:21 AM, sandeep vura sandeepv...@gmail.com
wrote:

 Hi yana,

 I have removed hive-site.xml from spark/conf directory but still getting
 the same errors. Anyother way to work around.

 Regards,
 Sandeep

 On Fri, Feb 27, 2015 at 9:38 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 I think you're mixing two things: the docs say When* not *configured by
 the hive-site.xml, the context automatically creates metastore_db and
 warehouse in the current directory.. AFAIK if you want a local
 metastore, you don't put hive-site.xml anywhere. You only need the file if
 you're going to point to an external metastore. If you're pointing to an
 external metastore, in my experience I've also had to copy core-site.xml
 into conf in order to specify this property:  namefs.defaultFS/name

 On Fri, Feb 27, 2015 at 10:39 AM, sandeep vura sandeepv...@gmail.com
 wrote:

 Hi Sparkers,

 I am using hive version - hive 0.13 and copied hive-site.xml in
 spark/conf and using default derby local metastore .

 While creating a table in spark shell getting the following error ..Can
 any one please look and give solution asap..

 sqlContext.sql(CREATE TABLE IF NOT EXISTS sandeep (key INT, value
 STRING))
 15/02/27 23:06:13 ERROR RetryingHMSHandler:
 MetaException(message:file:/user/hive/warehouse_1/sandeep is not a
 directory or unable to create one)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1239)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:622)
 at
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
 at
 com.sun.proxy.$Proxy12.create_table_with_environment_context(Unknown Source)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:558)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:547)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:622)
 at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 at com.sun.proxy.$Proxy13.createTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:613)
 at
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
 at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
 at
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
 at
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
 at
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
 at $line9.$read$$iwC$$iwC$$iwC

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska

I think you're mixing two things: the docs say When* not *configured by
the hive-site.xml, the context automatically creates metastore_db and
warehouse in the current directory.. AFAIK if you want a local metastore,
you don't put hive-site.xml anywhere. You only need the file if you're
going to point to an external metastore. If you're pointing to an external
metastore, in my experience I've also had to copy core-site.xml into conf
in order to specify this property:  namefs.defaultFS/name

On Fri, Feb 27, 2015 at 10:39 AM, sandeep vura sandeepv...@gmail.com
wrote:

 Hi Sparkers,

 I am using hive version - hive 0.13 and copied hive-site.xml in spark/conf
 and using default derby local metastore .

 While creating a table in spark shell getting the following error ..Can
 any one please look and give solution asap..

 sqlContext.sql(CREATE TABLE IF NOT EXISTS sandeep (key INT, value
 STRING))
 15/02/27 23:06:13 ERROR RetryingHMSHandler:
 MetaException(message:file:/user/hive/warehouse_1/sandeep is not a
 directory or unable to create one)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1239)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:622)
 at
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
 at
 com.sun.proxy.$Proxy12.create_table_with_environment_context(Unknown Source)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:558)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:547)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:622)
 at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 at com.sun.proxy.$Proxy13.createTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:613)
 at
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
 at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
 at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
 at
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
 at
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
 at $line9.$read$$iwC$$iwC$$iwC$$iwC.init(console:15)
 at $line9.$read$$iwC$$iwC$$iwC.init(console:20)
 at $line9.$read$$iwC$$iwC.init(console:22)
 at $line9.$read$$iwC.init(console:24)
 at $line9.$read.init(console:26)
 at $line9.$read$.init(console:30)
 at $line9.$read$.clinit(console)
 at $line9.$eval$.init(console:7)
 at $line9.$eval$.clinit(console)
 at $line9.$eval.$print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:622)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
 at

[SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yana Kadiyska

Can someone confirm if they can run UDFs in group by in spark1.2?

I have two builds running -- one from a custom build from early December
(commit 4259ca8dd12) which works fine, and Spark1.2-RC2.

On the latter I get:

 jdbc:hive2://XXX.208:10001 select
from_unixtime(epoch,'-MM-dd-HH'),count(*) count
. . . . . . . . . . . . . . . . . . from tbl
. . . . . . . . . . . . . . . . . . group by
from_unixtime(epoch,'-MM-dd-HH');
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Expression not in GROUP BY:
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)
AS _c0#1004, tree:
Aggregate 
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)],
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(epoch#1049L,-MM-dd-HH)
AS _c0#1004,COUNT(1) AS count#1003L]
 MetastoreRelation default, tbl, None (state=,code=0)



This worked fine on my older build. I don't see a JIRA on this but maybe
I'm not looking right. Can someone please advise?

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska

Imran, I have also observed the phenomenon of reducing the cores helping
with OOM. I wanted to ask this (hopefully without straying off topic): we
can specify the number of cores and the executor memory. But we don't get
to specify _how_ the cores are spread among executors.

Is it possible that with 24G memory and 4 cores we get a spread of 1 core
per executor thus ending up with 24G for the task, but with 24G memory and
10 cores some executor ends up with 3 cores on the same machine and thus we
have only 8G per task?

On Thu, Feb 26, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:

 Hi Yong,

 mostly correct except for:


- Since we are doing reduceByKey, shuffling will happen. Data will be
shuffled into 1000 partitions, as we have 1000 unique keys.

 no, you will not get 1000 partitions.  Spark has to decide how many
 partitions to use before it even knows how many unique keys there are.  If
 you have 200 as the default parallelism (or you just explicitly make it the
 second parameter to reduceByKey()), then you will get 200 partitions.  The
 1000 unique keys will be distributed across the 200 partitions.  ideally
 they will be distributed pretty equally, but how they get distributed
 depends on the partitioner (by default you will have a HashPartitioner, so
 it depends on the hash of your keys).

 Note that this is more or less the same as in Hadoop MapReduce.

 the amount of parallelism matters b/c there are various places in spark
 where there is some overhead proportional to the size of a partition.  So
 in your example, if you have 1000 unique keys in 200 partitions, you expect
 about 5 unique keys per partitions -- if instead you had 10 partitions,
 you'd expect 100 unique keys per partitions, and thus more data and you'd
 be more likely to hit an OOM.  But there are many other possible sources of
 OOM, so this is definitely not the *only* solution.

 Sorry I can't comment in particular about Spark SQL -- hopefully somebody
 more knowledgeable can comment on that.



 On Wed, Feb 25, 2015 at 8:58 PM, java8964 java8...@hotmail.com wrote:

 Hi, Sparkers:

 I come from the Hadoop MapReducer world, and try to understand some
 internal information of spark. From the web and this list, I keep seeing
 people talking about increase the parallelism if you get the OOM error. I
 tried to read document as much as possible to understand the RDD partition,
 and parallelism usage in the spark.

 I understand that for RDD from HDFS, by default, one partition will be
 one HDFS block, pretty straightforward. I saw that lots of RDD operations
 support 2nd parameter of parallelism. This is the part confuse me. From my
 understand, the parallelism is totally controlled by how many cores you
 give to your job. Adjust that parameter, or spark.default.parallelism
 shouldn't have any impact.

 For example, if I have a 10G data in HDFS, and assume the block size is
 128M, so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to
 a Pair RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey
 action, using 200 as the default parallelism. Here is what I assume:


- We have 100 partitions, as the data comes from 100 blocks. Most
likely the spark will generate 100 tasks to read and shuffle them?
- The 1000 unique keys mean the 1000 reducer group, like in MR
- If I set the max core to be 50, so there will be up to 50 tasks can
be run concurrently. The rest tasks just have to wait for the core, if
there are 50 tasks are running.
- Since we are doing reduceByKey, shuffling will happen. Data will be
shuffled into 1000 partitions, as we have 1000 unique keys.
- I don't know these 1000 partitions will be processed by how many
tasks, maybe this is the parallelism parameter comes in?
- No matter what parallelism this will be, there are ONLY 50 task can
be run concurrently. So if we set more cores, more partitions' data will 
 be
processed in the executor (which runs more thread in this case), so more
memory needs. I don't see how increasing parallelism could help the OOM in
this case.
- In my test case of Spark SQL, I gave 24G as the executor heap, my
join between 2 big datasets keeps getting OOM. I keep increasing the
spark.default.parallelism, from 200 to 400, to 2000, even to 4000, no
help. What really makes the query finish finally without OOM is after I
change the --total-executor-cores from 10 to 4.


 So my questions are:
 1) What is the parallelism really mean in the Spark? In the simple
 example above, for reduceByKey, what difference it is between parallelism
 change from 10 to 20?
 2) When we talk about partition in the spark, for the data coming from
 HDFS, I can understand the partition clearly. For the intermediate data,
 the partition will be same as key, right? For group, reducing, join action,
 uniqueness of the keys will be partition. Is that correct?
 3) Why increasing parallelism could help OOM?

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska

Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the
shuffle setting: spark.sql.shuffle.partitions

On Thu, Feb 26, 2015 at 5:51 PM, java8964 java8...@hotmail.com wrote:

 Imran, thanks for your explaining about the parallelism. That is very
 helpful.

 In my test case, I am only use one box cluster, with one executor. So if I
 put 10 cores, then 10 concurrent task will be run within this one executor,
 which will handle more data than 4 core case, then leaded to OOM.

 I haven't setup Spark on our production cluster yet, but assume that we
 have 100 nodes cluster, if I guess right, set up to 1000 cores mean that on
  average, each box's executor will run 10 threads to process data. So
 lowering core will reduce the speed of spark, but can help to avoid the
 OOM, as less data to be processed in the memory.

 My another guess is that each partition will be processed by one core
 eventually. So make bigger partition count can decrease partition size,
 which should help the memory footprint. In my case, I guess that Spark SQL
 in fact doesn't use the spark.default.parallelism setting, or at least in
 my query, it is not used. So no matter what I changed, it doesn't matter.
 The reason I said that is that there is always 200 tasks in stage 2 and 3
 of my query job, no matter what I set the spark.default.parallelism.

 I think lowering the core is to exchange lower memory usage vs speed. Hope
 my understanding is correct.

 Thanks

 Yong

 --
 Date: Thu, 26 Feb 2015 17:03:20 -0500
 Subject: Re: Help me understand the partition, parallelism in Spark
 From: yana.kadiy...@gmail.com
 To: iras...@cloudera.com
 CC: java8...@hotmail.com; user@spark.apache.org


 Imran, I have also observed the phenomenon of reducing the cores helping
 with OOM. I wanted to ask this (hopefully without straying off topic): we
 can specify the number of cores and the executor memory. But we don't get
 to specify _how_ the cores are spread among executors.

 Is it possible that with 24G memory and 4 cores we get a spread of 1 core
 per executor thus ending up with 24G for the task, but with 24G memory and
 10 cores some executor ends up with 3 cores on the same machine and thus we
 have only 8G per task?

 On Thu, Feb 26, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com
 wrote:

 Hi Yong,

 mostly correct except for:


- Since we are doing reduceByKey, shuffling will happen. Data will be
shuffled into 1000 partitions, as we have 1000 unique keys.

 no, you will not get 1000 partitions.  Spark has to decide how many
 partitions to use before it even knows how many unique keys there are.  If
 you have 200 as the default parallelism (or you just explicitly make it the
 second parameter to reduceByKey()), then you will get 200 partitions.  The
 1000 unique keys will be distributed across the 200 partitions.  ideally
 they will be distributed pretty equally, but how they get distributed
 depends on the partitioner (by default you will have a HashPartitioner, so
 it depends on the hash of your keys).

 Note that this is more or less the same as in Hadoop MapReduce.

 the amount of parallelism matters b/c there are various places in spark
 where there is some overhead proportional to the size of a partition.  So
 in your example, if you have 1000 unique keys in 200 partitions, you expect
 about 5 unique keys per partitions -- if instead you had 10 partitions,
 you'd expect 100 unique keys per partitions, and thus more data and you'd
 be more likely to hit an OOM.  But there are many other possible sources of
 OOM, so this is definitely not the *only* solution.

 Sorry I can't comment in particular about Spark SQL -- hopefully somebody
 more knowledgeable can comment on that.



 On Wed, Feb 25, 2015 at 8:58 PM, java8964 java8...@hotmail.com wrote:

 Hi, Sparkers:

 I come from the Hadoop MapReducer world, and try to understand some
 internal information of spark. From the web and this list, I keep seeing
 people talking about increase the parallelism if you get the OOM error. I
 tried to read document as much as possible to understand the RDD partition,
 and parallelism usage in the spark.

 I understand that for RDD from HDFS, by default, one partition will be one
 HDFS block, pretty straightforward. I saw that lots of RDD operations
 support 2nd parameter of parallelism. This is the part confuse me. From my
 understand, the parallelism is totally controlled by how many cores you
 give to your job. Adjust that parameter, or spark.default.parallelism
 shouldn't have any impact.

 For example, if I have a 10G data in HDFS, and assume the block size is
 128M, so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to
 a Pair RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey
 action, using 200 as the default parallelism. Here is what I assume:


- We have 100 partitions, as the data comes from 100 blocks. Most
likely the spark will generate 100 tasks to

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Yana Kadiyska

I am not sure if your issue is setting the Fair mode correctly or something
else so let's start with the FAIR mode.

Do you see scheduler mode actually being set to FAIR:

I have this line in spark-defaults.conf
spark.scheduler.allocation.file=/spark/conf/fairscheduler.xml

Then, when I start my application, I can see that it is using that
scheduler in the UI -- go to master UI and click on your application. Then
you should see this (note the scheduling mode is shown as Fair):





On Wed, Feb 25, 2015 at 4:06 AM, Harika Matha matha.har...@gmail.com
wrote:

 Hi Yana,

 I tried running the program after setting the property
 spark.scheduler.mode to FAIR. But the result is same as previous. Are
 there any other properties that have to be set?


 On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 It's hard to tell. I have not run this on EC2 but this worked for me:

 The only thing that I can think of is that the scheduling mode is set to

- *Scheduling Mode:* FAIR


 val pool: ExecutorService = Executors.newFixedThreadPool(poolSize)
 while_loop to get curr_job
  pool.execute(new ReportJob(sqlContext, curr_job, i))

 class ReportJob(sqlContext:org.apache.spark.sql.hive.HiveContext,query: 
 String,id:Int) extends Runnable with Logging {
   def threadId = (Thread.currentThread.getName() + \t)

   def run() {
 logInfo(s* Running ${threadId} ${id})
 val startTime = Platform.currentTime
 val hiveQuery=query
 val result_set = sqlContext.sql(hiveQuery)
 result_set.repartition(1)
 result_set.saveAsParquetFile(shdfs:///tmp/${id})
 logInfo(s* DONE ${threadId} ${id} time: 
 +(Platform.currentTime-startTime))
   }
 }

 

 On Tue, Feb 24, 2015 at 4:04 AM, Harika matha.har...@gmail.com wrote:

 Hi all,

 I have been running a simple SQL program on Spark. To test the
 concurrency,
 I have created 10 threads inside the program, all threads using same
 SQLContext object. When I ran the program on my EC2 cluster using
 spark-submit, only 3 threads were running in parallel. I have repeated
 the
 test on different EC2 clusters (containing different number of cores) and
 found out that only 3 threads are running in parallel on every cluster.

 Why is this behaviour seen? What does this number 3 specify?
 Is there any configuration parameter that I have to set if I want to run
 more threads concurrently?

 Thanks
 Harika



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-multiple-threads-with-same-Spark-Context-tp21784.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Executor size and checkpoints

2015-02-24 Thread Yana Kadiyska

Tathagata, yes, I was using StreamingContext.getOrCreate. My question is
about the design decision here. I was expecting that if I have a streaming
application that say crashed, and I wanted to give the executors more
memory, I would be able to restart, using the checkpointed RDD but with
more memory.

I thought deleting the checkpoints in a checkpointed application is the
last thing that you want to do (as you lose all state). Seems a bit harsh
to have to do this just to increase the amount of memory?

On Mon, Feb 23, 2015 at 11:12 PM, Tathagata Das t...@databricks.com wrote:

 Hey Yana,

 I think you posted screenshots, but they are not visible in the email.
 Probably better to upload them and post links.

 Are you using StreamingContext.getOrCreate? If that is being used, then it
 will recreate the SparkContext with SparkConf having whatever configuration
 is present in the existing checkpoint files. It may so happen that the
 existing checkpoint files were from an old run which had 512 configured. So
 the SparkConf in the restarted SparkContext/StremingContext is accidentally
 picking up the old configuration. Deleting the checkpoint files avoided a
 restart, and the new config took affect. Maybe. :)

 TD

 On Sat, Feb 21, 2015 at 7:30 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Hi all,

 I had a streaming application and midway through things decided to up the
 executor memory. I spent a long time launching like this:

 ~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest
 --executor-memory 2G --master...

 and observing the executor memory is still at old 512 setting

 I was about to ask if this is a bug when I decided to delete the
 checkpoints. Sure enough the setting took after that.

 So my question is -- why is it required to remove checkpoints to increase
 memory allowed on an executor? This seems pretty un-intuitive to me.

 Thanks for any insights.

[SparkSQL] Number of map tasks in SparkSQL

2015-02-24 Thread Yana Kadiyska

Shark used to have shark.map.tasks variable. Is there an equivalent for
Spark SQL?

We are trying a scenario with heavily partitioned Hive tables. We end up
with a UnionRDD with a lot of partitions underneath and hence too many
tasks:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L202

is there a good way to tell SQL to coalesce these?

thanks for any pointers

Re: Running multiple threads with same Spark Context

2015-02-24 Thread Yana Kadiyska

It's hard to tell. I have not run this on EC2 but this worked for me:

The only thing that I can think of is that the scheduling mode is set to

   - *Scheduling Mode:* FAIR


val pool: ExecutorService = Executors.newFixedThreadPool(poolSize)
while_loop to get curr_job
 pool.execute(new ReportJob(sqlContext, curr_job, i))

class ReportJob(sqlContext:org.apache.spark.sql.hive.HiveContext,query:
String,id:Int) extends Runnable with Logging {
  def threadId = (Thread.currentThread.getName() + \t)

  def run() {
logInfo(s* Running ${threadId} ${id})
val startTime = Platform.currentTime
val hiveQuery=query
val result_set = sqlContext.sql(hiveQuery)
result_set.repartition(1)
result_set.saveAsParquetFile(shdfs:///tmp/${id})
logInfo(s* DONE ${threadId} ${id} time:
+(Platform.currentTime-startTime))
  }
}



On Tue, Feb 24, 2015 at 4:04 AM, Harika matha.har...@gmail.com wrote:

 Hi all,

 I have been running a simple SQL program on Spark. To test the concurrency,
 I have created 10 threads inside the program, all threads using same
 SQLContext object. When I ran the program on my EC2 cluster using
 spark-submit, only 3 threads were running in parallel. I have repeated the
 test on different EC2 clusters (containing different number of cores) and
 found out that only 3 threads are running in parallel on every cluster.

 Why is this behaviour seen? What does this number 3 specify?
 Is there any configuration parameter that I have to set if I want to run
 more threads concurrently?

 Thanks
 Harika



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-multiple-threads-with-same-Spark-Context-tp21784.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Executor size and checkpoints

2015-02-21 Thread Yana Kadiyska

Hi all,

I had a streaming application and midway through things decided to up the
executor memory. I spent a long time launching like this:

~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest
--executor-memory 2G --master...

and observing the executor memory is still at old 512 setting

I was about to ask if this is a bug when I decided to delete the
checkpoints. Sure enough the setting took after that.

So my question is -- why is it required to remove checkpoints to increase
memory allowed on an executor? This seems pretty un-intuitive to me.

Thanks for any insights.

textFile partitions

2015-02-09 Thread Yana Kadiyska

Hi folks, puzzled by something pretty simple:

I have a standalone cluster with default parallelism of 2, spark-shell
running with 2 cores

sc.textFile(README.md).partitions.size returns 2 (this makes sense)
sc.textFile(README.md).coalesce(100,true).partitions.size returns 100,
also makes sense

but

sc.textFile(README.md,100).partitions.size
 gives 102 --I was expecting this to be equivalent to last statement
(i.e.result in 100 partitions)

I'd appreciate if someone can enlighten me as to why I end up with 102
This is on Spark 1.2

thanks

Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-23 Thread Yana Kadiyska

if you're running the test via sbt you can examine the classpath that sbt
uses for the test (show runtime:full-classpath or last run)-- I find this
helps once too many includes and excludes interact.

On Thu, Jan 22, 2015 at 3:50 PM, Adrian Mocanu amoc...@verticalscope.com
wrote:


 I use spark 1.1.0-SNAPSHOT and the test I'm running is in local mode. My
 test case uses org.apache.spark.streaming.TestSuiteBase

 val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT %
 provided excludeAll(
 val sparkStreaming= org.apache.spark % spark-streaming_2.10 %
 1.1.0-SNAPSHOT % provided excludeAll(
 val sparkCassandra= com.tuplejump % calliope_2.10 % 0.9.0-C2-EA
 exclude(org.apache.cassandra, cassandra-all)
 exclude(org.apache.cassandra, cassandra-thrift)
 val casAll = org.apache.cassandra % cassandra-all % 2.0.3
 intransitive()
 val casThrift = org.apache.cassandra % cassandra-thrift % 2.0.3
 intransitive()
 val sparkStreamingFromKafka = org.apache.spark %
 spark-streaming-kafka_2.10 % 0.9.1 excludeAll(


 -Original Message-
 From: Sean Owen [mailto:so...@cloudera.com]
 Sent: January-22-15 11:39 AM
 To: Adrian Mocanu
 Cc: u...@spark.incubator.apache.org
 Subject: Re: Exception: NoSuchMethodError:
 org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

 NoSuchMethodError almost always means that you have compiled some code
 against one version of a library but are running against another. I wonder
 if you are including different versions of Spark in your project, or
 running against a cluster on an older version?

 On Thu, Jan 22, 2015 at 3:57 PM, Adrian Mocanu amoc...@verticalscope.com
 wrote:
  Hi
 
  I get this exception when I run a Spark test case on my local machine:
 
 
 
  An exception or error caused a run to abort:
  org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo
  rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca
  la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/
  dstream/PairDStreamFunctions;
 
  java.lang.NoSuchMethodError:
  org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lo
  rg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lsca
  la/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/streaming/
  dstream/PairDStreamFunctions;
 
 
 
  In my test case I have these Spark related imports imports:
 
  import org.apache.spark.streaming.StreamingContext._
 
  import org.apache.spark.streaming.TestSuiteBase
 
  import org.apache.spark.streaming.dstream.DStream
 
  import
  org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
 
 
 
  -Adrian
 
 
  B CB
   [  X  ܚX K  K[XZ[
   \ \ ][  X  ܚX P   \ ˘\ X  K ܙ B  ܈ Y  ] [ۘ[[X[ K[XZ[
   \ \ Z [ \ ˘\ X  K ܙ B B

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Results never return to driver | Spark Custom Reader

2015-01-23 Thread Yana Kadiyska

It looks to me like your executor actually crashed and didn't just finish
properly.

Can you check the executor log?

It is available in the UI, or on the worker machine, under $SPARK_HOME/work/
 app-20150123155114-/6/stderr  (unless you manually changed the work
directory location but in that case I'd assume you know where to find the
log)

On Thu, Jan 22, 2015 at 10:54 PM, Harihar Nahak hna...@wynyardgroup.com
wrote:

 Hi All,

 I wrote a custom reader to read a DB, and it is able to return key and
 value
 as expected but after it finished it never returned to driver

 here is output of worker log :
 15/01/23 15:51:38 INFO worker.ExecutorRunner: Launch command: java -cp

 ::/usr/local/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/usr/local/hadoop/etc/hadoop
 -XX:MaxPermSize=128m -Dspark.driver.port=53484 -Xms1024M -Xmx1024M
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@VM90:53484/user/CoarseGrainedScheduler 6 VM99
 4 app-20150123155114-
 akka.tcp://sparkWorker@VM99:44826/user/Worker
 15/01/23 15:51:47 INFO worker.Worker: Executor app-20150123155114-/6
 finished with state EXITED message Command exited with code 1 exitStatus 1
 15/01/23 15:51:47 WARN remote.ReliableDeliverySupervisor: Association with
 remote system [akka.tcp://sparkExecutor@VM99:57695] has failed, address is
 now gated for [5000] ms. Reason is: [Disassociated].
 15/01/23 15:51:47 INFO actor.LocalActorRef: Message
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
 Actor[akka://sparkWorker/deadLetters] to

 Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40143.96.25.29%3A35065-4#-915179653]
 was not delivered. [3] dead letters encountered. This logging can be turned
 off or adjusted with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.
 15/01/23 15:51:49 INFO worker.Worker: Asked to kill unknown executor
 app-20150123155114-/6

 If someone noticed any clue to fixed that will really appreciate.



 -
 --Harihar
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Results-never-return-to-driver-Spark-Custom-Reader-tp21328.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Yana Kadiyska

https://issues.apache.org/jira/browse/SPARK-5389

I marked as minor since I also just discovered that I can run it under
PowerShell just fine. Vladimir, feel free to change the bug if you're
getting a different message or a more serious issue.

On Fri, Jan 23, 2015 at 4:44 PM, Josh Rosen rosenvi...@gmail.com wrote:

Do you mind filing a JIRA issue for this which includes the actual error
message string that you saw? https://issues.apache.org/jira/browse/SPARK

On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

I am not sure if you get the same exception as I do -- spark-shell2.cmd
works fine for me. Windows 7 as well. I've never bothered looking to fix it
as it seems spark-shell just calls spark-shell2 anyway...

On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko
protsenk...@gmail.com wrote:

I have a problem with running spark shell in windows 7. I made the
following
steps:

1. downloaded and installed Scala 2.11.5
2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git
3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests
clean
package (in git bash)

After installation tried to run spark-shell.cmd in cmd shell and it says
there is a syntax error in file. What could I do to fix problem?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-has-syntax-error-on-windows-tp21313.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark-shell has syntax error on windows.

2015-01-22 Thread Yana Kadiyska

I am not sure if you get the same exception as I do -- spark-shell2.cmd
works fine for me. Windows 7 as well. I've never bothered looking to fix it
as it seems spark-shell just calls spark-shell2 anyway...

On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com
wrote:

 I have a problem with running spark shell in windows 7. I made the
 following
 steps:

 1. downloaded and installed Scala 2.11.5
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests
 clean
 package (in git bash)

 After installation tried to run spark-shell.cmd in cmd shell and it says
 there is a syntax error in file. What could I do to fix problem?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-has-syntax-error-on-windows-tp21313.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Installing Spark Standalone to a Cluster

2015-01-22 Thread Yana Kadiyska

You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're
missing --master. In addition, it's a good idea to pass with --master
exactly the spark master's endpoint as shown on your UI under
http://localhost:8080. But that should do it. If that's not working, you
can look at the Worker log and see where it's trying to connect to and if
it's getting any errors.

On Thu, Jan 22, 2015 at 12:06 PM, riginos samarasrigi...@gmail.com wrote:

 I have downloaded spark-1.2.0.tgz on each of my node and execute ./sbt/sbt
 assembly on each of them.  So I execute. /sbin/start-master.sh on my master
 and ./bin/spark-class org.apache.spark.deploy.worker.Worker
 spark://IP:PORT.
 Althought when I got to http://localhost:8080 I cannot see any worker. Why
 is that? Do I do something wrong with the installation deploy of the spark?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21319.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Yana Kadiyska

Thanks for looking Cheng. Just to clarify in case other people need this
sooner, setting sc.hadoopConfiguration.set(parquet.task.side.metadata,
false)did work well in terms of dropping rowgroups/showing small input
size. What was odd about that is that the overall time wasn't much
better...but maybe that was overhead from sending the metadata clientside.

Thanks again and looking forward to your fix

On Tue, Jan 20, 2015 at 9:07 PM, Cheng Lian lian.cs@gmail.com wrote:

  Hey Yana,

 Sorry for the late reply, missed this important thread somehow. And many
 thanks for reporting this. It turned out to be a bug — filter pushdown is
 only enabled when using client side metadata, which is not expected,
 because task side metadata code path is more performant. And I guess that
 the reason why setting parquet.task.side.metadata to false didn’t reduce
 input size for you is because you set the configuration with Spark API, or
 put it into spark-defaults.conf. This configuration goes to Hadoop
 Configuration, and Spark only merge properties whose names start with
 spark.hadoop into Hadoop Configuration instances. You may try to put
 parquet.task.side.metadata config into Hadoop core-site.xml, and then
 re-run the query. I can see significant differences by doing so.

 I’ll open a JIRA and deliver a fix for this ASAP. Thanks again for
 reporting all the details!

 Cheng

 On 1/13/15 12:56 PM, Yana Kadiyska wrote:

   Attempting to bump this up in case someone can help out after all. I
 spent a few good hours stepping through the code today, so I'll summarize
 my observations both in hope I get some help and to help others that might
 be looking into this:

  1. I am setting *spark.sql.parquet.**filterPushdown=true*
 2. I can see by stepping through the driver debugger that
 PaquetTableOperations.execute sets the filters via
 ParquetInputFormat.setFilterPredicate (I checked the conf object, things
 appear OK there)
 3. In FilteringParquetRowInputFormat, I get through the codepath for
 getTaskSideSplits. It seems that the codepath for getClientSideSplits would
 try to drop rowGroups but I don't see similar in getTaskSideSplit.

  Does anyone have pointers on where to look after this? Where is rowgroup
 filtering happening in the case of getTaskSideSplits? I can attach to the
 executor but am not quite sure what code related to Parquet gets called
 executor side...also don't see any messages in the executor logs related to
 Filtering predicates.

 For comparison, I went through the getClientSideSplits and can see that
 predicate pushdown works OK:


 sc.hadoopConfiguration.set(parquet.task.side.metadata,false)

 15/01/13 20:04:49 INFO FilteringParquetRowInputFormat: Using Client Side 
 Metadata Split Strategy
 15/01/13 20:05:13 INFO FilterCompat: Filtering using predicate: eq(epoch, 
 1417384800)
 15/01/13 20:06:45 INFO FilteringParquetRowInputFormat: Dropping 572 row 
 groups that do not pass filter predicate (28 %) !

 

  Is it possible that this is just a UI bug? I can see Input=4G when using
 (parquet.task.side.metadata,false) and Input=140G when using
 (parquet.task.side.metadata,true) but the runtimes are very comparable?

  [image: Inline image 1]


  JobId 4 is the ClientSide split, JobId 5 is the TaskSide split.



  On Fri, Jan 9, 2015 at 2:56 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 I am running the following (connecting to an external Hive Metastore)

   /a/shark/spark/bin/spark-shell --master spark://ip:7077  --conf
 *spark.sql.parquet.filterPushdown=true*

  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

  and then ran two queries:

 sqlContext.sql(select count(*) from table where partition='blah' )
 andsqlContext.sql(select count(*) from table where partition='blah' and 
 epoch=1415561604)

 

  According to the Input tab in the UI both scan about 140G of data which
 is the size of my whole partition. So I have two questions --

  1. is there a way to tell from the plan if a predicate pushdown is
 supposed to happen?
 I see this for the second query

 res0: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[0] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 == Physical Plan ==
 Aggregate false, [], [Coalesce(SUM(PartialCount#49L),0) AS _c0#0L]
  Exchange SinglePartition
   Aggregate true, [], [COUNT(1) AS PartialCount#49L]
OutputFaker []
 Project []
  ParquetTableScan [epoch#139L], (ParquetRelation list of hdfs files

 
  2. am I doing something obviously wrong that this is not working? (Im
 guessing it's not woring because the input size for the second query shows
 unchanged and the execution time is almost 2x as long)

  thanks in advance for any insights

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-19 Thread Yana Kadiyska

If you're talking about filter pushdowns for parquet files this also has to
be turned on explicitly. Try  *spark.sql.parquet.**filterPushdown=true . *It's
off by default

On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote:

 Yes it works!
 But the filter can't pushdown!!!

 If custom parquetinputformat only implement the datasource API?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com:

 Thanks yana!
 I will try it!

 在 2015年1月16日，20:51，yana yana.kadiy...@gmail.com 写道：

 I think you might need to set
 spark.sql.hive.convertMetastoreParquet to false if I understand that flag
 correctly

 Sent on the new Sprint Network from my Samsung Galaxy S®4.


  Original message 
 From: Xiaoyu Wang
 Date:01/16/2015 5:09 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Why custom parquet format hive table execute ParquetTableScan
 physical plan, not HiveTableScan?

 Hi all!

 In the Spark SQL1.2.0.
 I create a hive table with custom parquet inputformat and outputformat.
 like this :
 CREATE TABLE test(
   id string,
   msg string)
 CLUSTERED BY (
   id)
 SORTED BY (
   id ASC)
 INTO 10 BUCKETS
 ROW FORMAT SERDE
   '*com.a.MyParquetHiveSerDe*'
 STORED AS INPUTFORMAT
   '*com.a.MyParquetInputFormat*'
 OUTPUTFORMAT
   '*com.a.MyParquetOutputFormat*';

 And the spark shell see the plan of select * from test is :

 [== Physical Plan ==]
 [!OutputFaker [id#5,msg#6]]
 [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
 hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration:
 core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
 yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
 org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

 *Not HiveTableScan*!!!
 *So it dosn't execute my custom inputformat!*
 Why? How can it execute my custom inputformat?

 Thanks!

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-17 Thread Yana Kadiyska

Just wondering if you've made any progress on this -- I'm having the same
issue. My attempts to help myself are documented here
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E
.

I don't believe I have the value scattered through all blocks issue either
as running with
sc.hadoopConfiguration.set(parquet.task.side.metadata,false) shows a
much smaller Input size for me and it is the exact same parquet files being
scanned.

On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Yes, the problem is, I've turned the flag on.

 One possible reason for this is, the parquet file supports predicate
 pushdown by setting statistical min/max value of each column on parquet
 blocks. If in my test, the groupID=10113000 is scattered in all parquet
 blocks, then the predicate pushdown fails.

 But, I'm not quite sure about that. I don't know whether there is any
 other reason that can lead to this.


 On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger c...@koeninger.org
 wrote:

 But Xuelin already posted in the original message that the code was using

 SET spark.sql.parquet.filterPushdown=true

 On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Quoting Michael:
 Predicate push down into the input format is turned off by default
 because there is a bug in the current parquet library that null pointers
 when there are full row groups that are null.

 https://issues.apache.org/jira/browse/SPARK-4258

 You can turn it on if you want:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

 Daniel

 On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID
 wrote:


 Hi,

I'm testing parquet file format, and the predicate pushdown is a
 very useful feature for us.

However, it looks like the predicate push down doesn't work after
 I set
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)

Here is my sql:
*sqlContext.sql(select adId, adTitle  from ad where
 groupId=10113000).collect*

Then, I checked the amount of input data on the WEB UI. But the
 amount of input data is ALWAYS 80.2M regardless whether I turn the 
 spark.sql.parquet.filterPushdown
 flag on or off.

I'm not sure, if there is anything that I must do when *generating
 *the parquet file in order to make the predicate pushdown available.
 (Like ORC file, when creating the ORC file, I need to explicitly sort the
 field that will be used for predicate pushdown)

Anyone have any idea?

And, anyone knows the internal mechanism for parquet predicate
 pushdown?

Thanks

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska

yeah, that makes sense. Pala, are you on a prebuild version of Spark -- I
just tried the CDH4 prebuilt...Here is what I get for the = token:

[image: Inline image 1]

The literal type shows as 290, not 291, and 290 is numeric. According to
this
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/0.13.1/org/apache/hadoop/hive/ql/parse/HiveParser.java#HiveParser
291 is token PLUS which is really weird...


On Wed, Jan 14, 2015 at 7:47 PM, Cheng, Hao hao.ch...@intel.com wrote:

  The log showed it failed in parsing, so the typo stuff shouldn’t be the
 root cause. BUT I couldn’t reproduce that with master branch.



 I did the test as follow:



 sbt/sbt –Phadoop-2.3.0 –Phadoop-2.3 –Phive –Phive-0.13.1 hive/console

 scala sql(“SELECT user_id FROM actions where
 conversion_aciton_id=20141210”)



 sbt/sbt –Phadoop-2.3.0 –Phadoop-2.3 –Phive –Phive-0.12.0 hive/console

 scala sql(“SELECT user_id FROM actions where
 conversion_aciton_id=20141210”)





 *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
 *Sent:* Wednesday, January 14, 2015 11:12 PM
 *To:* Pala M Muthaia
 *Cc:* user@spark.apache.org
 *Subject:* Re: Issues with constants in Spark HiveQL queries



 Just a guess but what is the type of conversion_aciton_id? I do queries
 over an epoch all the time with no issues(where epoch's type is bigint).
 You can see the source here
 https://github.com/apache/spark/blob/v1.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
  --
 not sure what ASTNode type: 291 but it sounds like it's not considered
 numeric? If it's a string it should be conversion_aciton_id=*'*20141210*'
 *(single quotes around the string)



 On Tue, Jan 13, 2015 at 5:25 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote:

  Hi,



 We are testing Spark SQL-Hive QL, on Spark 1.2.0. We have run some simple
 queries successfully, but we hit the following issue whenever we attempt to
 use a constant in the query predicate.



 It seems like an issue with parsing constant.



 Query: SELECT user_id FROM actions where conversion_aciton_id=20141210



 Error:

 scala.NotImplementedError: No parse rules for ASTNode type: 291, text:
 20141210 :

 20141210



 Any ideas? This seems very basic, so we may be missing something basic,
 but i haven't figured out what it is.



 ---



 Full shell output below:



 scala sqlContext.sql(SELECT user_id FROM actions where
 conversion_aciton_id=20141210)

 15/01/13 16:55:54 INFO ParseDriver: Parsing command: SELECT user_id FROM
 actions where conversion_aciton_id=20141210

 15/01/13 16:55:54 INFO ParseDriver: Parse Completed

 15/01/13 16:55:54 INFO ParseDriver: Parsing command: SELECT user_id FROM
 actions where conversion_aciton_id=20141210

 15/01/13 16:55:54 INFO ParseDriver: Parse Completed

 java.lang.RuntimeException:

 Unsupported language features in query: SELECT user_id FROM actions where
 conversion_aciton_id=20141210

 TOK_QUERY

   TOK_FROM

 TOK_TABREF

   TOK_TABNAME

 actions

   TOK_INSERT

 TOK_DESTINATION

   TOK_DIR

 TOK_TMP_FILE

 TOK_SELECT

   TOK_SELEXPR

 TOK_TABLE_OR_COL

   user_id

 TOK_WHERE

   =

 TOK_TABLE_OR_COL

   conversion_aciton_id

 20141210



 scala.NotImplementedError: No parse rules for ASTNode type: 291, text:
 20141210 :

 20141210

  +



 org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1110)



 at scala.sys.package$.error(package.scala:27)

 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:251)

 at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)

 at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)

 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)

 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)

 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)

 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)

 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)

 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)

 at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)

 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)

 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)

 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

 at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)

 at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)

 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska

Just a guess but what is the type of conversion_aciton_id? I do queries
over an epoch all the time with no issues(where epoch's type is bigint).
You can see the source here
https://github.com/apache/spark/blob/v1.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
--
not sure what ASTNode type: 291 but it sounds like it's not considered
numeric? If it's a string it should be
conversion_aciton_id=*'*20141210*' *(single
quotes around the string)

On Tue, Jan 13, 2015 at 5:25 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Hi,

 We are testing Spark SQL-Hive QL, on Spark 1.2.0. We have run some simple
 queries successfully, but we hit the following issue whenever we attempt to
 use a constant in the query predicate.

 It seems like an issue with parsing constant.

 Query: SELECT user_id FROM actions where conversion_aciton_id=20141210

 Error:
 scala.NotImplementedError: No parse rules for ASTNode type: 291, text:
 20141210 :
 20141210

 Any ideas? This seems very basic, so we may be missing something basic,
 but i haven't figured out what it is.

 ---

 Full shell output below:

 scala sqlContext.sql(SELECT user_id FROM actions where
 conversion_aciton_id=20141210)
 15/01/13 16:55:54 INFO ParseDriver: Parsing command: SELECT user_id FROM
 actions where conversion_aciton_id=20141210
 15/01/13 16:55:54 INFO ParseDriver: Parse Completed
 15/01/13 16:55:54 INFO ParseDriver: Parsing command: SELECT user_id FROM
 actions where conversion_aciton_id=20141210
 15/01/13 16:55:54 INFO ParseDriver: Parse Completed
 java.lang.RuntimeException:
 Unsupported language features in query: SELECT user_id FROM actions where
 conversion_aciton_id=20141210
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
 actions
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_TABLE_OR_COL
   user_id
 TOK_WHERE
   =
 TOK_TABLE_OR_COL
   conversion_aciton_id
 20141210

 scala.NotImplementedError: No parse rules for ASTNode type: 291, text:
 20141210 :
 20141210
  +

 org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1110)

 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:251)
 at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
 at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 at
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 at
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:133)
 at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:133)
 at
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 at
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at

Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread Yana Kadiyska

If the wildcard path you have doesn't work you should probably open a bug
-- I had a similar problem with Parquet and it was a bug which recently got
closed. Not sure if sqlContext.avroFile shares a codepath with
.parquetFile...you
can try running with bits that have the fix for .parquetFile or look at the
source...
Here was my question for reference:
http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E

On Wed, Jan 14, 2015 at 4:34 AM, David Jones letsnumsperi...@gmail.com
wrote:

 Hi,

 I have a program that loads a single avro file using spark SQL, queries
 it, transforms it and then outputs the data. The file is loaded with:

 val records = sqlContext.avroFile(filePath)
 val data = records.registerTempTable(data)
 ...


 Now I want to run it over tens of thousands of Avro files (all with
 schemas that contain the fields I'm interested in).

 Is it possible to load multiple avro files recursively from a top-level
 directory using wildcards? All my avro files are stored under
 s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
 these on EMR.

 If that's not possible, is there some way to load multiple avro files into
 the same table/RDD so the whole dataset can be processed (and in that case
 I'd supply paths to each file concretely, but I *really* don't want to have
 to do that).

 Thanks
 David

1 2 >

1 - 100 of 174 matches

Mail list logo