Getting : format(target_id, ".", name), value) .. error

2021-02-08 Thread shahab
Hello,

I am getting this in unclear error message when I read a parquet file, it
seems something is wrong with data but what? I googled a lot but did not
find any clue. I hope some spark experts could help me with this?

best,
Shahab


Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
line 94, in rdd
jrdd = self._jdf.javaToPython()
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o622.javaToPython.


Re: Error using .collect()

2019-05-13 Thread Shahab Yunus
Kumar sp, collect() brings in all the data represented by the rdd/dataframe
into the memory of the single machine which is acting like driver. You will
run out of memory if the underlying rdd/dataframe represents large volume
of data distributed on several machines.

If your data is huge even after grouping by, can you perform a join between
this map and your other dataset rather than trying to fit the map in memory?

Regards,
Shahab

On Mon, May 13, 2019 at 3:58 PM Kumar sp  wrote:

> I have a use case where i am using collect().toMap (Group by certain
> column and finding count ,creating map with a key) and use that map to
> enable some further calculations.
>
> I am getting Out of memory errors and is there any alternative than
> .collect() to create a structure like Map or some better way to handle
> this?.
>
> Thank you ,
> kumar
>


Spark SQL LIMIT Gets Stuck

2019-05-01 Thread Shahab Yunus
Hi there.

I have a Hive external table (storage format is ORC, data stored on S3,
partitioned on one bigint type column) that I am trying to query through
pyspark (or spark-shell) shell. df.count() fails with lower values of LIMIT
clause with the following exception (seen in Spark UI.) df.show() works.

*Caused by:
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
Timeout waiting for connection from pool*

* at
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)*

* at
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)*

* at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)*

* at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*

* at java.lang.reflect.Method.invoke(Method.java:498)*

* at
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)*

It works for higher value of LIMIT like 100+ or if I don't use the LIMIT at
all. E.g. these work:
*>>>df=spark.sql("select * from schema_name.table_name")*
*>>>df.count()*
*>>>df=spark.sql("select * from schema_name.table_name limit 100")*
*>>>df.count()*

But these doesn't and fail with above mentioned exception when I do the
count. The count operations also get stuck at the same number of tasks
executed every time.
*>>>spark.sql("select * from schema_name.table_name limit 2")*
*>>>df.count()*
*>>>spark.sql("select * from schema_name.table_name limit 10")*
*>>>df.count()*

What could be reason? How can I debug this? The cluster is 3 node
m4.2xlarge but it happens in all sizes of clusters so I don't it is a
resource issue (especially also it works fine with all of the table data.)

Version info:

Hive 2.3.3

Spark 2.3.2

EMR 5.18


Thanks.


Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread Shahab Yunus
Could be a tangential idea but might help: Why not use queryExecution and
logicalPlan objects that are available when you execute a query using
SparkSession and get a DataFrame back? The Json representation contains
almost all the info that you need and you don't need to go to Hive to get
this info.

Some details here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Dataset.html#queryExecution

On Wed, Jan 23, 2019 at 5:35 PM Ramandeep Singh Nanda 
wrote:

> Explain extended or explain would list the plan along with the tables. Not
> aware of any statements that explicitly list dependencies or tables
> directly.
>
> Regards,
> Ramandeep Singh
>
> On Wed, Jan 23, 2019, 11:05 Tomas Bartalos  wrote:
>
>> This might help:
>>
>> show tables;
>>
>> st 23. 1. 2019 o 10:43  napísal(a):
>>
>>> Hi, All,
>>>
>>> We need to get all input tables of several SPARK SQL 'select' statements.
>>>
>>> We can get those information of Hive SQL statements by using 'explain
>>> dependency select'.
>>> But I can't find the equivalent command for SPARK SQL.
>>>
>>> Does anyone know how to get this information of a SPARK SQL 'select'
>>> statement?
>>>
>>> Thanks
>>>
>>> Boying
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> 本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。
>>>
>>>
>>>
>>> This email message may contain confidential and/or privileged
>>> information. If you are not the intended recipient, please do not read,
>>> save, forward, disclose or copy the contents of this email or open any file
>>> attached to this email. We will be grateful if you could advise the sender
>>> immediately by replying this email, and delete this email and any
>>> attachment or links to this email completely and immediately from your
>>> computer system.
>>>
>>>
>>> --
>>>
>>>


Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
2 options I can think of:

1- Can you perform a union of dfs returned by elastic research queries. It
would still be distributed but I don't know if you will run out of how many
union operations you can perform at a time.

2- Can you used some other api method of elastic search other than which
returns a dataframe?

On Fri, Dec 28, 2018 at 10:30 PM  wrote:

> I could , but only if I had it beforehand.  I do not know what the
> dataframe is until I pass the query parameter and receive the resultant
> dataframe inside the iteration.
>
>
>
> The steps are :
>
>
>
> Original DF -> Iterate -> Pass every element to a function that takes the
> element of the original DF and returns a new dataframe including all the
> matching terms
>
>
>
>
>
> *From:* Andrew Melo 
> *Sent:* Friday, December 28, 2018 8:48 PM
> *To:* em...@yeikel.com
> *Cc:* Shahab Yunus ; user 
> *Subject:* Re: What are the alternatives to nested DataFrames?
>
>
>
> Could you join() the DFs on a common key?
>
>
>
> On Fri, Dec 28, 2018 at 18:35  wrote:
>
> Shabad , I am not sure what you are trying to say. Could you please give
> me an example? The result of the Query is a Dataframe that is created after
> iterating, so I am not sure how could I map that to a column without
> iterating and getting the values.
>
>
>
> I have a Dataframe that contains a list of cities for which I would like
> to iterate over and search in Elasticsearch.  This list is stored in
> Dataframe because it contains hundreds of thousands of elements with
> multiple properties that would not fit in a single machine.
>
>
>
> The issue is that the elastic-spark connector returns a Dataframe as well
> which leads to a dataframe creation within a Dataframe
>
>
>
> The only solution I found is to store the list of cities in a a regular
> scala Seq and iterate over that, but as far as I know this would make Seq
> centralized instead of distributed (run at the executor only?)
>
>
>
> Full example :
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *val cities= Seq("New York","Michigan")cities.foreach(r => {  val qb =
> QueryBuilders.matchQuery("name", r).operator(Operator.AND)
> print(qb.toString)  val dfs = sqlContext.esDF("cities/docs", qb.toString)
> // Returns a dataframe for each city  dfs.show() // Works as expected. It
> prints the individual dataframe with the result of the query})*
>
>
>
>
>
> *val cities = Seq("New York","Michigan").toDF()*
>
>
>
> *cities.foreach(r => {*
>
>
>
> *  val city  = r.getString(0)*
>
>
>
> *  val qb = QueryBuilders.matchQuery("name",
> city).operator(Operator.AND)*
>
> *  print(qb.toString)*
>
>
>
> *  val dfs = sqlContext.esDF("cities/docs", qb.toString) // null
> pointer*
>
>
>
> *  dfs.show()*
>
>
>
> *})*
>
>
>
>
>
> *From:* Shahab Yunus 
> *Sent:* Friday, December 28, 2018 12:34 PM
> *To:* em...@yeikel.com
> *Cc:* user 
> *Subject:* Re: What are the alternatives to nested DataFrames?
>
>
>
> Can you have a dataframe with a column which stores json (type string)? Or
> you can also have a column of array type in which you store all cities
> matching your query.
>
>
>
>
>
>
>
> On Fri, Dec 28, 2018 at 2:48 AM  wrote:
>
> Hi community ,
>
>
>
> As shown in other answers online , Spark does not support the nesting of
> DataFrames , but what are the options?
>
>
>
> I have the following scenario :
>
>
>
> dataFrame1 = List of Cities
>
>
>
> dataFrame2 = Created after searching in ElasticSearch for each city in
> dataFrame1
>
>
>
> I've tried :
>
>
>
>  val cities= sc.parallelize(Seq("New York")).toDF()
>
>cities.foreach(r => {
>
> val companyName = r.getString(0)
>
> println(companyName)
>
> val dfs = sqlContext.esDF("cities/docs", "?q=" + companyName)
>  //returns a DataFrame consisting of all the cities matching the entry in
> cities
>
> })
>
>
>
> Which triggers the expected null pointer exception
>
>
>
> java.lang.NullPointerException
>
> at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:53)
>
> at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:51)
>
> at
> org.elasticsearch.spark.sql.package$SQLContextFunctions.esDF(package.scala:37)
>
> at Main$$anonfun$main$1.apply(Main.scala:43)
>
> at Main$$anonfun$main$1.apply(Main.scala:39)
>
> at 

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
Can you have a dataframe with a column which stores json (type string)? Or
you can also have a column of array type in which you store all cities
matching your query.



On Fri, Dec 28, 2018 at 2:48 AM  wrote:

> Hi community ,
>
>
>
> As shown in other answers online , Spark does not support the nesting of
> DataFrames , but what are the options?
>
>
>
> I have the following scenario :
>
>
>
> dataFrame1 = List of Cities
>
>
>
> dataFrame2 = Created after searching in ElasticSearch for each city in
> dataFrame1
>
>
>
> I've tried :
>
>
>
>  val cities= sc.parallelize(Seq("New York")).toDF()
>
>cities.foreach(r => {
>
> val companyName = r.getString(0)
>
> println(companyName)
>
> val dfs = sqlContext.esDF("cities/docs", "?q=" + companyName)
>  //returns a DataFrame consisting of all the cities matching the entry in
> cities
>
> })
>
>
>
> Which triggers the expected null pointer exception
>
>
>
> java.lang.NullPointerException
>
> at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:53)
>
> at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:51)
>
> at
> org.elasticsearch.spark.sql.package$SQLContextFunctions.esDF(package.scala:37)
>
> at Main$$anonfun$main$1.apply(Main.scala:43)
>
> at Main$$anonfun$main$1.apply(Main.scala:39)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
> at java.lang.Thread.run(Thread.java:748)
>
> 2018-12-28 02:01:00 ERROR TaskSetManager:70 - Task 7 in stage 0.0 failed 1
> times; aborting job
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent
> failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver):
> java.lang.NullPointerException
>
>
>
> What options do I have?
>
>
>
> Thank you.
>


Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
Sorry Devender, I hit the send button sooner by mistake. I meant to add
more info.

So what I was trying to say was that you can use withColumn with
when/otherwise clauses to add a column conditionally. See an example here:
https://stackoverflow.com/questions/34908448/spark-add-column-to-dataframe-conditionally

On Tue, Dec 18, 2018 at 9:55 AM Shahab Yunus  wrote:

> Have you tried using withColumn? You can add a boolean column based on
> whether the age exists or not and then drop the older age column. You
> wouldn't need union of dataframes then
>
> On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav <
> devender.ya...@impetus.co.in> wrote:
>
>> Hi All,
>>
>>
>> useful code:
>>
>> public class EmployeeBean implements Serializable {
>>
>> private Long id;
>>
>> private String name;
>>
>> private Long salary;
>>
>> private Integer age;
>>
>> // getters and setters
>>
>> }
>>
>>
>> Relevant spark code:
>>
>> SparkSession spark =
>> SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
>> List employees1 = populateEmployees(1, 10);
>>
>> Dataset ds1 = spark.createDataset(employees1,
>> Encoders.bean(EmployeeBean.class));
>> ds1.show();
>> ds1.printSchema();
>>
>> Dataset ds2 = ds1.where("age is null").withColumn("is_age_null",
>> lit(true));
>> Dataset ds3 = ds1.where("age is not null").withColumn("is_age_null",
>> lit(false));
>>
>> Dataset ds4 = ds2.union(ds3);
>> ds4.show();
>>
>>
>> Relevant Output:
>>
>>
>> ds1
>>
>> ++---++--+
>> | age| id|name|salary|
>> ++---++--+
>> |null|  1|dev1| 11000|
>> |   2|  2|dev2| 12000|
>> |null|  3|dev3| 13000|
>> |   4|  4|dev4| 14000|
>> |null|  5|dev5| 15000|
>> ++---++--+
>>
>>
>> ds4
>>
>> ++---++--+---+
>> | age| id|name|salary|is_age_null|
>> ++---++--+---+
>> |null|  1|dev1| 11000|   true|
>> |null|  3|dev3| 13000|   true|
>> |null|  5|dev5| 15000|   true|
>> |   2|  2|dev2| 12000|  false|
>> |   4|  4|dev4| 14000|  false|
>> ++---++--+---+
>>
>>
>> Is there any better solution to add this column in the dataset rather
>> than creating two datasets and performing union?
>>
>> <
>> https://stackoverflow.com/questions/53834286/add-column-value-in-spark-dataset-on-the-basis-of-the-condition
>> >
>>
>>
>>
>> Regards,
>> Devender
>>
>> 
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that the
>> communication is free of errors, virus, interception or interference.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
Have you tried using withColumn? You can add a boolean column based on
whether the age exists or not and then drop the older age column. You
wouldn't need union of dataframes then

On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav 
wrote:

> Hi All,
>
>
> useful code:
>
> public class EmployeeBean implements Serializable {
>
> private Long id;
>
> private String name;
>
> private Long salary;
>
> private Integer age;
>
> // getters and setters
>
> }
>
>
> Relevant spark code:
>
> SparkSession spark =
> SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
> List employees1 = populateEmployees(1, 10);
>
> Dataset ds1 = spark.createDataset(employees1,
> Encoders.bean(EmployeeBean.class));
> ds1.show();
> ds1.printSchema();
>
> Dataset ds2 = ds1.where("age is null").withColumn("is_age_null",
> lit(true));
> Dataset ds3 = ds1.where("age is not null").withColumn("is_age_null",
> lit(false));
>
> Dataset ds4 = ds2.union(ds3);
> ds4.show();
>
>
> Relevant Output:
>
>
> ds1
>
> ++---++--+
> | age| id|name|salary|
> ++---++--+
> |null|  1|dev1| 11000|
> |   2|  2|dev2| 12000|
> |null|  3|dev3| 13000|
> |   4|  4|dev4| 14000|
> |null|  5|dev5| 15000|
> ++---++--+
>
>
> ds4
>
> ++---++--+---+
> | age| id|name|salary|is_age_null|
> ++---++--+---+
> |null|  1|dev1| 11000|   true|
> |null|  3|dev3| 13000|   true|
> |null|  5|dev5| 15000|   true|
> |   2|  2|dev2| 12000|  false|
> |   4|  4|dev4| 14000|  false|
> ++---++--+---+
>
>
> Is there any better solution to add this column in the dataset rather than
> creating two datasets and performing union?
>
> <
> https://stackoverflow.com/questions/53834286/add-column-value-in-spark-dataset-on-the-basis-of-the-condition
> >
>
>
>
> Regards,
> Devender
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Parallel read parquet file, write to postgresql

2018-12-03 Thread Shahab Yunus
Hi James.

--num-executors is use to control the number of parallel tasks (each per
executors) running for your application. For reading and writing data in
parallel data partitioning is employed. You can look here for quick intro
how data partitioning work:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
.
https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297

You are write that numPartitions is the parameter that could be used to
control that though in general spark itself identifies given the data in
each stage, how to partition (i.e. how much to parallelize the read and
write of data.)



On Mon, Dec 3, 2018 at 8:40 AM James Starks 
wrote:

> Reading Spark doc (
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's
> not mentioned how to parallel read parquet file with SparkSession. Would
> --num-executors just work? Any additional parameters needed to be added to
> SparkSession as well?
>
> Also if I want to parallel write data to database, would options
> 'numPartitions' and 'batchsize' enough to improve write performance? For
> example,
>
>  mydf.format("jdbc").
>  option("driver", "org.postgresql.Driver").
>  option("url", url).
>  option("dbtable", table_name).
>  option("user", username).
>  option("password", password).
>  option("numPartitions", N) .
>  option("batchsize", M)
>  save
>
> From Spark website (
> https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases),
> I only find these two parameters that would have impact  on db write
> performance.
>
> I appreciate any suggestions.
>


Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread Shahab Yunus
Curious why you think this is not smart code?

On Mon, Dec 3, 2018 at 8:04 AM James Starks 
wrote:

> By taking with your advice flatMap, now I can convert result from
> RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform
> flatMap in the end before starting to convert RDD object back to DF (i.e.
> SparkSession.createDataFrame(rddRecordsOfMyCaseClass)). For instance,
>
> df.map { ... }.filter{ ... }.flatMap { records => records.flatMap { record
> => Seq(record) } }
>
> Not smart code, but it works for my case.
>
> Thanks for the advice!
>
>
>
>
> ‐‐‐ Original Message ‐‐‐
> On Saturday, December 1, 2018 12:17 PM, Chris Teoh 
> wrote:
>
> Hi James,
>
> Try flatMap (_.toList). See below example:-
>
> scala> case class MyClass(i:Int)
> defined class MyClass
>
> scala> val r = 1 to 100
> r: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7,
> 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
> 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
> 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
> 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
> 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
>
> scala> val r2 = 101 to 200
> r2: scala.collection.immutable.Range.Inclusive = Range(101, 102, 103, 104,
> 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
> 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
> 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149,
> 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
> 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
> 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
> 195, 196, 197, 198, 199, 200)
>
> scala> val c1 = r.map(MyClass(_)).toIterable
> c1: Iterable[MyClass] = Vector(MyClass(1), MyClass(2), MyClass(3),
> MyClass(4), MyClass(5), MyClass(6), MyClass(7), MyClass(8), MyClass(9),
> MyClass(10), MyClass(11), MyClass(12), MyClass(13), MyClass(14),
> MyClass(15), MyClass(16), MyClass(17), MyClass(18), MyClass(19),
> MyClass(20), MyClass(21), MyClass(22), MyClass(23), MyClass(24),
> MyClass(25), MyClass(26), MyClass(27), MyClass(28), MyClass(29),
> MyClass(30), MyClass(31), MyClass(32), MyClass(33), MyClass(34),
> MyClass(35), MyClass(36), MyClass(37), MyClass(38), MyClass(39),
> MyClass(40), MyClass(41), MyClass(42), MyClass(43), MyClass(44),
> MyClass(45), MyClass(46), MyClass(47), MyClass(48), MyClass(49),
> MyClass(50), MyClass(51), MyClass(52), MyClass(53), MyClass(54),
> MyClass(55), MyClass(56), MyClass(57), MyClass(58), MyClass(59), MyClass(...
>
> scala> val c2 = r2.map(MyClass(_)).toIterable
> c2: Iterable[MyClass] = Vector(MyClass(101), MyClass(102), MyClass(103),
> MyClass(104), MyClass(105), MyClass(106), MyClass(107), MyClass(108),
> MyClass(109), MyClass(110), MyClass(111), MyClass(112), MyClass(113),
> MyClass(114), MyClass(115), MyClass(116), MyClass(117), MyClass(118),
> MyClass(119), MyClass(120), MyClass(121), MyClass(122), MyClass(123),
> MyClass(124), MyClass(125), MyClass(126), MyClass(127), MyClass(128),
> MyClass(129), MyClass(130), MyClass(131), MyClass(132), MyClass(133),
> MyClass(134), MyClass(135), MyClass(136), MyClass(137), MyClass(138),
> MyClass(139), MyClass(140), MyClass(141), MyClass(142), MyClass(143),
> MyClass(144), MyClass(145), MyClass(146), MyClass(147), MyClass(148),
> MyClass(149), MyClass(150), MyClass(151), MyClass(152), MyClass(153),
> MyClass(154), MyClass(15...
> scala> val rddIt = sc.parallelize(Seq(c1,c2))
> rddIt: org.apache.spark.rdd.RDD[Iterable[MyClass]] =
> ParallelCollectionRDD[2] at parallelize at :28
>
> scala> rddIt.flatMap(_.toList)
> res4: org.apache.spark.rdd.RDD[MyClass] = MapPartitionsRDD[3] at flatMap
> at :26
>
> res4 is what you're looking for.
>
>
> On Sat, 1 Dec 2018 at 21:09, Chris Teoh  wrote:
>
>> Do you have the full code example?
>>
>> I think this would be similar to the mapPartitions code flow, something
>> like flatMap( _ =>  _.toList )
>>
>> I haven't yet tested this out but this is how I'd first try.
>>
>> On Sat, 1 Dec 2018 at 01:02, James Starks 
>> wrote:
>>
>>> When processing data, I create an instance of RDD[Iterable[MyCaseClass]]
>>> and I want to convert it to RDD[MyCaseClass] so that it can be further
>>> converted to dataset or dataframe with toDS() function. But I encounter a
>>> problem that SparkContext can not be instantiated within SparkSession.map
>>> function because it already exists, even with allowMultipleContexts set to
>>> true.
>>>
>>> val sc = new SparkConf()
>>> sc.set("spark.driver.allowMultipleContexts", "true")
>>> new SparkContext(sc).parallelize(seq)
>>>
>>> How can I fix this?
>>>
>>> Thanks.
>>>
>>
>>
>> --
>> Chris
>>
>
>
> --
> Chris
>
>
>


Re: Creating spark Row from database values

2018-09-26 Thread Shahab Yunus
Hi there. Have you seen this link?
https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393


It shows you multiple ways to manually create a dataframe.

Hope it helps.

Regards,
Shahab

On Wed, Sep 26, 2018 at 8:02 AM Kuttaiah Robin  wrote:

> Hello,
>
> Currently I have Oracle database table with description as shown below;
>
> Table INSIGHT_ID_FED_IDENTIFIERS
>   -
> CURRENT_INSTANCE_ID   VARCHAR2(100)
> PREVIOUS_INSTANCE_ID  VARCHAR2(100)
>
>
> Sample values in the table basically output of select * from
> INSIGHT_ID_FED_IDENTIFIERS. For simplicity I have put only one row.
>
>
> CURRENT_INSTANCE_ID   PREVIOUS_INSTANCE_ID
> ---   ---
> curInstanceId1  prevInstanceId1
>
>
> I have the spark schema associated with it.
>
>
> Now I need to create a Spark row(org.apache.spark.sql.Row) out of it.
>
> Can someone help me understanding on how this can be achieved?
>
> regards,
> Robin Kuttaiah
>


Unsubscribe

2018-04-23 Thread Shahab Yunus
Unsubscribe


Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Thanks guys.

@Filipp Zhinkin
Yes, we might have couple of string columns which will have 15million+
unique values which need to be mapped to indices.

@Nick Pentreath
We are on 2.0.2 though I will check it out. Is it better from hashing
collision perspective or can handle large volume of data as well?

Regards,
Shahab

On Tue, Apr 10, 2018 at 10:05 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
> this use case in a more natural way than HashingTF (and handles multiple
> columns at once).
>
>
>
> On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin <filipp.zhin...@gmail.com>
> wrote:
>
>> Hi Shahab,
>>
>> do you actually need to have a few columns with such a huge amount of
>> categories whose value depends on original value's frequency?
>>
>> If no, then you may use value's hash code as a category or combine all
>> columns into a single vector using HashingTF.
>>
>> Regards,
>> Filipp.
>>
>> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com>
>> wrote:
>> > Is the StringIndexer keeps all the mapped label to indices in the
>> memory of
>> > the driver machine? It seems to be unless I am missing something.
>> >
>> > What if our data that needs to be indexed is huge and columns to be
>> indexed
>> > are high cardinality (or with lots of categories) and more than one such
>> > column need to be indexed? Meaning it wouldn't fit in memory.
>> >
>> > Thanks.
>> >
>> > Regards,
>> > Shahab
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Is the StringIndexer
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala>
keeps all the mapped label to indices in the memory of the driver machine?
It seems to be unless I am missing something.

What if our data that needs to be indexed is huge and columns to be indexed
are high cardinality (or with lots of categories) and more than one such
column need to be indexed? Meaning it wouldn't fit in memory.

Thanks.

Regards,
Shahab


Warnings on data insert into Hive Table using PySpark

2018-03-19 Thread Shahab Yunus
Hi there. When I try to insert data into hive tables using the following
query, I get these warnings below. The data is inserted fine (the query
works without warning directly on hive cli as well.) What is the reason for
these warnings and how can we get rid of them?

I am using pyspark interpreter.

*spark_session.sql("insert into schema_name.table_name
(partition_col='JobA') values ('value1', 'value2', '2018-03-10')")*

*-chgrp: '' does not match expected pattern for group*
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*
*-chgrp: '' does not match expected pattern for group
  *
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*
*-chgrp: '' does not match expected pattern for group*
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*
*-chgrp: '' does not match expected pattern for group*
*Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...*

Software:
Scala version 2.11.8
(OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Spark 2.0.2
Hadoop 2.7.3-amzn-0


Thanks & Regards,
Shahab


Accessing Scala RDD from pyspark

2018-03-15 Thread Shahab Yunus
Hi there.

I am calling custom Scala code from pyspark (interpreter). The customer
Scala code is simple: it just reads a textFile using sparkContext.textFile
and returns RDD[String].

In pyspark, I am using sc._jvm to make the call to the Scala code:


*s_rdd = sc._jvm.package_name.class_name.method().*

It returns a py4j.JavaObject. Now I want to use this in pyspark and doing
the following wrapping:
*py_rdd = RDD(s_dd, sparkSession)*

No error yet. But when I make a call to any RDD methods using py_rdd (e.g.
py_rdd.count()), I get the following error:
py4j.protocol.Py4JError: An error occurred while calling o50.rdd. Trace:
py4j.Py4JException: Method rdd([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)

Why is that? What I am doing wrong?

Using:
Scala version 2.11.8
(OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Spark 2.0.2
Hadoop 2.7.3-amzn-0


Thanks & Regards,
Shahab


unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-06 Thread shahab
Hi,

I am trying to use Spark 1.5, Mlib, but I keep getting
 "sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming_2.10;1.5.0: not found" .

It is weird that this happens, but I could not find any solution for this.
Does any one faced the same issue?


best,
/Shahab

Here is my SBT library dependencies:

libraryDependencies ++= Seq(

"com.google.guava" % "guava" % "16.0"  ,

"org.apache.spark" % "spark-unsafe_2.10" % "1.5.0",

"org.apache.spark" % "spark-core_2.10" % "1.5.0",

"org.apache.spark" % "spark-mllib_2.10" % "1.5.0",

"org.apache.hadoop" % "hadoop-client" % "2.6.0",

"net.java.dev.jets3t" % "jets3t" % "0.9.0" % "provided",

"com.github.nscala-time" %% "nscala-time" % "1.0.0",

"org.scalatest" % "scalatest_2.10" % "2.1.3",

"junit" % "junit" % "4.8.1" % "test",

"net.jpountz.lz4" % "lz4" % "1.2.0" % "provided",

"org.clapper" %% "grizzled-slf4j" % "1.0.2",

"net.jpountz.lz4" % "lz4" % "1.2.0" % "provided"

   )


Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

2015-09-18 Thread shahab
Hi,

Probably I have wrong zeppelin  configuration, because I get the following
error when I execute spark statements in Zeppelin:

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't
running on a cluster. Deployment to YARN is not supported directly by
SparkContext. Please use spark-submit.


Anyone knows What's the solution to this?

best,
/Shahab


Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread shahab
It works using yarn-client but I want to make it running on cluster. Is
there any way to do so?

best,
/Shahab

On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Can you try yarn-client mode?
>
> On Fri, Sep 18, 2015, 3:38 PM shahab <shahab.mok...@gmail.com> wrote:
>
>> Hi,
>>
>> Probably I have wrong zeppelin  configuration, because I get the
>> following error when I execute spark statements in Zeppelin:
>>
>> org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't
>> running on a cluster. Deployment to YARN is not supported directly by
>> SparkContext. Please use spark-submit.
>>
>>
>> Anyone knows What's the solution to this?
>>
>> best,
>> /Shahab
>>
>


Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread shahab
Thank you all for the comments, but my problem still exists.
@Dean,@Ewan yes, I do have hadoop file system installed and working

@Sujit: the last version of EMR (version 4)  does not need manual copying
of jar file to the server. The blog that you pointed out refers to older
version (3.x) of EMR. But I will try your solution as well.
@Neil : I think something is wrong with my fat jar file, I think I am
missing some dependencies in my jar file !

Again thank you all

/Shahab

On Wed, Sep 9, 2015 at 11:28 PM, Dean Wampler <deanwamp...@gmail.com> wrote:

> If you log into the cluster, do you see the file if you type:
>
> hdfs dfs
> -ls 
> hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar
>
> (with the correct server address for "ipx-x-x-x"). If not, is the server
> address correct and routable inside the cluster. Recall that EC2 instances
> have both public and private host names & IP addresses.
>
> Also, is the port number correct for HDFS in the cluster?
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Wed, Sep 9, 2015 at 9:28 AM, shahab <shahab.mok...@gmail.com> wrote:
>
>> Hi,
>> I am using Spark on Amazon EMR. So far I have not succeeded to submit
>> the application successfully, not sure what's problem. In the log file I
>> see the followings.
>> java.io.FileNotFoundException: File does not exist:
>> hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar
>>
>> However, even putting spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar in the
>> fat jar file didn't solve the problem. I am out of clue now.
>> I want to submit a spark application, using aws web console, as a step. I
>> submit the application as : spark-submit --deploy-mode cluster --class
>> mypack.MyMainClass --master yarn-cluster s3://mybucket/MySparkApp.jar Is
>> there any one who has similar problem with EMR?
>>
>> best,
>> /Shahab
>>
>
>


[Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-09 Thread shahab
 Hi,
I am using Spark on Amazon EMR. So far I have not succeeded to submit the
application successfully, not sure what's problem. In the log file I see
the followings.
java.io.FileNotFoundException: File does not exist:
hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

However, even putting spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar in the
fat jar file didn't solve the problem. I am out of clue now.
I want to submit a spark application, using aws web console, as a step. I
submit the application as : spark-submit --deploy-mode cluster --class
mypack.MyMainClass --master yarn-cluster s3://mybucket/MySparkApp.jar Is
there any one who has similar problem with EMR?

best,
/Shahab


Zeppelin + Spark on EMR

2015-09-07 Thread shahab
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the
script provided by Anders (
https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup
Zeppelin. The Zeppelin can connect to Spark but when I got error when I run
the tutorials. and I get the following error:

...FileNotFoundException: File
file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar
does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin
with EMR .

best,
/Shahab


Re: spark - redshift !!!

2015-07-08 Thread shahab
Sorry, I misunderstood.

best,
/Shahab

On Wed, Jul 8, 2015 at 9:52 AM, spark user spark_u...@yahoo.com wrote:

 Hi 'I am looking how to load data in redshift .
 Thanks



   On Wednesday, July 8, 2015 12:47 AM, shahab shahab.mok...@gmail.com
 wrote:


 Hi,

 I did some experiment with loading data from s3 into spark. I loaded data
 from s3 using sc.textFile(). Have a look at the following code snippet:

 val csv = sc.textFile(s3n://mybucket/myfile.csv)
   val rdd = csv.map(line = line.split(,).map(elem = elem.trim))  //
 my data format is in CSV format, comma separated
   .map (r =  MyIbject(r(3), r(4).toLong, r(5).toLong, r(6)))  //just map
 it to the target object format

 hope this helps,
 best,
 /Shahab


 On Wed, Jul 8, 2015 at 12:57 AM, spark user spark_u...@yahoo.com.invalid
 wrote:

 Hi
 Can you help me how to load data from s3 bucket to  redshift , if you gave
 sample code can you pls send me

 Thanks
 su







Performing sc.paralleize (..) in workers not in the driver program

2015-06-25 Thread shahab
Hi,

Apparently, sc.paralleize (..)  operation is performed in the driver
program not in the workers ! Is it possible to do this in worker process
for the sake of scalability?

best
/Shahab


Re: Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-15 Thread shahab
Thanks Akhil, it solved the problem.

best
/Shahab

On Fri, Jun 12, 2015 at 8:50 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Looks like your spark is not able to pick up the HADOOP_CONF. To fix this,
 you can actually add jets3t-0.9.0.jar to the classpath
 (sc.addJar(/path/to/jets3t-0.9.0.jar).

 Thanks
 Best Regards

 On Thu, Jun 11, 2015 at 6:44 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I tried to read a csv file from amazon s3, but I get the following
 exception which I have no clue how to solve this. I tried both spark 1.3.1
 and 1.2.1, but no success.  Any idea how to solve this is appreciated.


 best,
 /Shahab

 the code:

 val hadoopConf=sc.hadoopConfiguration;

  hadoopConf.set(fs.s3.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem)

  hadoopConf.set(fs.s3.awsAccessKeyId, aws_access_key_id)

  hadoopConf.set(fs.s3.awsSecretAccessKey, aws_secret_access_key)

  val csv = sc.textFile(s3n://mybucket/info.csv)  // original file

  val data = csv.map(line = line.split(,).map(elem = elem.trim)) //lines
 in rows


 Here is the exception I faced:

 Exception in thread main java.lang.NoClassDefFoundError:
 org/jets3t/service/ServiceException

 at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(
 NativeS3FileSystem.java:280)

 at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(
 NativeS3FileSystem.java:270)

 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)

 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)

 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431
 )

 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)

 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)

 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

 at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(
 FileInputFormat.java:256)

 at org.apache.hadoop.mapred.FileInputFormat.listStatus(
 FileInputFormat.java:228)

 at org.apache.hadoop.mapred.FileInputFormat.getSplits(
 FileInputFormat.java:304)

 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
 MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
 MapPartitionsRDD.scala:32)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.count(RDD.scala:1006)





Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-11 Thread shahab
Hi,

I tried to read a csv file from amazon s3, but I get the following
exception which I have no clue how to solve this. I tried both spark 1.3.1
and 1.2.1, but no success.  Any idea how to solve this is appreciated.


best,
/Shahab

the code:

val hadoopConf=sc.hadoopConfiguration;

 hadoopConf.set(fs.s3.impl,
org.apache.hadoop.fs.s3native.NativeS3FileSystem)

 hadoopConf.set(fs.s3.awsAccessKeyId, aws_access_key_id)

 hadoopConf.set(fs.s3.awsSecretAccessKey, aws_secret_access_key)

 val csv = sc.textFile(s3n://mybucket/info.csv)  // original file

 val data = csv.map(line = line.split(,).map(elem = elem.trim)) //lines
in rows


Here is the exception I faced:

Exception in thread main java.lang.NoClassDefFoundError:
org/jets3t/service/ServiceException

at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(
NativeS3FileSystem.java:280)

at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(
NativeS3FileSystem.java:270)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(
FileInputFormat.java:256)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(
FileInputFormat.java:228)

at org.apache.hadoop.mapred.FileInputFormat.getSplits(
FileInputFormat.java:304)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
MapPartitionsRDD.scala:32)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
MapPartitionsRDD.scala:32)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

at org.apache.spark.rdd.RDD.count(RDD.scala:1006)


Re: PostgreSQL JDBC Classpath Issue

2015-06-10 Thread shahab
Hi George,

I have same issue, did you manage to find a solution?

best,
/Shahab

On Wed, May 13, 2015 at 9:21 PM, George Adams g.w.adams...@gmail.com
wrote:

  Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my
 classpath. I’ve outlined the issue on Stack Overflow (
 http://stackoverflow.com/questions/30221677/spark-sql-postgresql-data-source-issues).
 I’m not sure how to fix this since I built the uber jar using sbt-assembly
 and the final jar does have org/postgresql/Driver.class.

 —
 George Adams, IV
 Software Craftsman
 Brand Networks, Inc.
 (585) 902-8822



Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread shahab
Thanks Tristan for sharing this. Actually this happens when I am reading a
csv file of 3.5 GB.

best,
/Shahab



On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org
wrote:

 Hi Shahab,

 I’ve seen exceptions very similar to this (it also manifests as negative
 array size exception), and I believe it’s a really bug in Kryo.

 See this thread:

 http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail.com%3E

 Manifests in all of the following situations when working with an object
 graph in excess of a few GB: Joins, Broadcasts, and when using the hadoop
 save APIs.

 Tristan


 On 3 May 2015 at 07:26, Olivier Girardot ssab...@gmail.com wrote:

 Can you post your code, otherwise there's not much we can do.

 Regards,

 Olivier.

 Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit :

 Hi,

 I am using sprak-1.2.0 and I used Kryo serialization but I get the
 following excepton.

 java.io.IOException: com.esotericsoftware.kryo.KryoException:
 java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

 I do apprecciate if anyone could tell me how I can resolve this?

 best,
 /Shahab





how to make sure data is partitioned across all workers?

2015-05-04 Thread shahab
Hi,

Is there any way to enforce Spark to partition cached data across all
worker nodes, so all data is not cached only  in one of the worker nodes?

best,
/Shahab


java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-04 Thread shahab
Hi,

I am getting No space left on device exception when doing repartitioning
 of approx. 285 MB of data  while these is still 2 GB space left ??

does it mean that repartitioning needs more space (more than 2 GB) for
repartitioning of 285 MB of data ??

best,
/Shahab

java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
at 
sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:473)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:569)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:331)
at 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-02 Thread shahab
Hi,

I am using sprak-1.2.0 and I used Kryo serialization but I get the
following excepton.

java.io.IOException: com.esotericsoftware.kryo.KryoException:
java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

I do apprecciate if anyone could tell me how I can resolve this?

best,
/Shahab


is there anyway to enforce Spark to cache data in all worker nodes (almost equally) ?

2015-04-30 Thread shahab
Hi,

I load data from Cassandra into spark The entire data is almost around 482
MB. and it is cached as TempTable in 7 tables. How can I enforce spark to
cache data in both worker nodes not only in ONE worker (as in my case)?

I am using spark 2.1.1 with spark-connector 1.2.0-rc3. I have small
stand-alone cluster with two  nodes A, B. Where node A accommodates
Cassandra,  Spark Master and Worker and node B contains the second spark
worker.

best,
/Shahab


Re: is there anyway to enforce Spark to cache data in all worker nodes(almost equally) ?

2015-04-30 Thread shahab
Thanks Alex, but 482MB was just example size, and  I am looking for
generic approach  doing this without broadcasting,

any idea?

best,
/Shahab

On Thu, Apr 30, 2015 at 4:21 PM, Alex lxv...@gmail.com wrote:

 482 MB should be small enough to be distributed as a set of broadcast
 variables. Then you can use local features of spark to process.
 --
 From: shahab shahab.mok...@gmail.com
 Sent: ‎4/‎30/‎2015 9:42 AM
 To: user@spark.apache.org
 Subject: is there anyway to enforce Spark to cache data in all worker
 nodes(almost equally) ?

 Hi,

 I load data from Cassandra into spark The entire data is almost around 482
 MB. and it is cached as TempTable in 7 tables. How can I enforce spark to
 cache data in both worker nodes not only in ONE worker (as in my case)?

 I am using spark 2.1.1 with spark-connector 1.2.0-rc3. I have small
 stand-alone cluster with two  nodes A, B. Where node A accommodates
 Cassandra,  Spark Master and Worker and node B contains the second spark
 worker.

 best,
 /Shahab



why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Hi,

I was looking at SparkUI, Executors, and I noticed that I have 597 MB for
 Shuffle while I am using cached temp-table and the Spark had 2 GB free
memory (the number under Memory Used is 597 MB /2.6 GB) ?!!!

Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be
done in memory?

best,

/Shahab


Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Thanks Saisai. I will try your solution, but still i don't understand why
filesystem should be used where there is a plenty of memory available!



On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com
wrote:

 Shuffle write will finally spill the data into file system as a bunch of
 files. If you want to avoid disk write, you can mount a ramdisk and
 configure spark.local.dir to this ram disk. So shuffle output will write
 to memory based FS, and will not introduce disk IO.

 Thanks
 Jerry

 2015-03-30 17:15 GMT+08:00 shahab shahab.mok...@gmail.com:

 Hi,

 I was looking at SparkUI, Executors, and I noticed that I have 597 MB for
  Shuffle while I am using cached temp-table and the Spark had 2 GB free
 memory (the number under Memory Used is 597 MB /2.6 GB) ?!!!

 Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be
 done in memory?

 best,

 /Shahab





Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-13 Thread shahab
Thanks, it makes sense.

On Thursday, March 12, 2015, Daniel Siegmann daniel.siegm...@teamaol.com
wrote:

 Join causes a shuffle (sending data across the network). I expect it will
 be better to filter before you join, so you reduce the amount of data which
 is sent across the network.

 Note this would be true for *any* transformation which causes a shuffle.
 It would not be true if you're combining RDDs with union, since that
 doesn't cause a shuffle.

 On Thu, Mar 12, 2015 at 11:04 AM, shahab shahab.mok...@gmail.com
 javascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com'); wrote:

 Hi,

 Probably this question is already answered sometime in the mailing list,
 but i couldn't find it. Sorry for posting this again.

 I need to to join and apply filtering on three different RDDs, I just
 wonder which of the following alternatives are more efficient:
 1- first joint all three RDDs and then do  filtering on resulting joint
 RDD   or
 2- Apply filtering on each individual RDD and then join the resulting RDDs


 Or probably there is no difference due to lazy evaluation and under
 beneath Spark optimisation?

 best,
 /Shahab





Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab
Hi,

I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs
can be registered as a function in HiveContext, I could not find any
documentation of how UDAFs can be registered in the HiveContext?? so far
what I have found is to make a JAR file, out of developed UDAF class, and
then deploy the JAR file to SparkSQL .

But is there any way to avoid deploying the jar file and register it
programmatically?


best,
/Shahab


Does any one know how to deploy a custom UDAF jar file in SparkSQL?

2015-03-10 Thread shahab
Hi,

Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where
should i put the jar file so SparkSQL can pick it up and make it accessible
for SparkSQL applications?
I do not use spark-shell instead I want to use it in an spark application.

best,
/Shahab


Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab
Thanks Hao,
But my question concerns UDAF (user defined aggregation function ) not
UDTF( user defined type function ).
I appreciate if you could point me to some starting point on UDAF
development in Spark.

Thanks
Shahab

On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.com wrote:

  Currently, Spark SQL doesn’t provide interface for developing the custom
 UDTF, but it can work seamless with Hive UDTF.



 I am working on the UDTF refactoring for Spark SQL, hopefully will provide
 an Hive independent UDTF soon after that.



 *From:* shahab [mailto:shahab.mok...@gmail.com
 javascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com');]
 *Sent:* Tuesday, March 10, 2015 5:44 PM
 *To:* user@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
 *Subject:* Registering custom UDAFs with HiveConetxt in SparkSQL, how?



 Hi,



 I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs
 can be registered as a function in HiveContext, I could not find any
 documentation of how UDAFs can be registered in the HiveContext?? so far
 what I have found is to make a JAR file, out of developed UDAF class, and
 then deploy the JAR file to SparkSQL .



 But is there any way to avoid deploying the jar file and register it
 programmatically?





 best,

 /Shahab



Re: Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread shahab
Thanks Cheng, my problem was some misspelling problem which I just fixed,
unfortunately the exception message sometimes does not pin point to exact
reason.  Sorry my bad.



On Wed, Mar 4, 2015 at 5:02 PM, Cheng, Hao hao.ch...@intel.com wrote:

  I’ve tried with latest code, seems it works, which version are you using
 Shahab?



 *From:* yana [mailto:yana.kadiy...@gmail.com]
 *Sent:* Wednesday, March 4, 2015 8:47 PM
 *To:* shahab; user@spark.apache.org
 *Subject:* RE: Does SparkSQL support . having count (fieldname) in
 SQL statement?



 I think the problem is that you are using an alias in the having clause. I
 am not able to try just now but see if HAVING count (*) 2 works ( ie dont
 use cnt)





 Sent on the new Sprint Network from my Samsung Galaxy S®4.



  Original message 

 From: shahab

 Date:03/04/2015 7:22 AM (GMT-05:00)

 To: user@spark.apache.org

 Subject: Does SparkSQL support . having count (fieldname) in SQL
 statement?



 Hi,



 It seems that SparkSQL, even the HiveContext, does not support SQL
 statements like :   SELECT category, count(1) AS cnt FROM products GROUP BY
 category HAVING cnt  10;



 I get this exception:



 Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 Unresolved attributes: CAST(('cnt  2), BooleanType), tree:





 I couldn't find anywhere is documentation whether having keyword is not
 supported ?

 If this is the case, what would be the work around? using two nested
 select statements?



 best,

 /Shahab



Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread shahab
Hi,

It seems that SparkSQL, even the HiveContext, does not support SQL
statements like :   SELECT category, count(1) AS cnt FROM products GROUP BY
category HAVING cnt  10;

I get this exception:

Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Unresolved attributes: CAST(('cnt  2), BooleanType), tree:


I couldn't find anywhere is documentation whether having keyword is not
supported ?
If this is the case, what would be the work around? using two nested select
statements?

best,
/Shahab


Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
@Cheng :My problem is that the connector I use to query Spark does not
support latest Hive (0.12, 0.13), But I need to perform Hive Queries on
data retrieved from Cassandra. I assumed that if I get data out of
cassandra in some way and register it as Temp table I would be able to
query it using HiveContext, but it seems I can not do this!

@Yes, it is added in Hive 0.12, but do you mean It is not supported by
HiveContext in Spark

Thanks,
/Shahab

On Tue, Mar 3, 2015 at 5:23 PM, Yin Huai yh...@databricks.com wrote:

 Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1
 (versions that we support). Seems
 https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently.

 On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The temp table in metastore can not be shared cross SQLContext
 instances, since HiveContext is a sub class of SQLContext (inherits all of
 its functionality), why not using a single HiveContext globally? Is there
 any specific requirement in your case that you need multiple
 SQLContext/HiveContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 9:46 PM

 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



 You are right ,  CassandraAwareSQLContext is subclass of SQL context.



 But I did another experiment, I queried Cassandra
 using CassandraAwareSQLContext, then I registered the rdd as a temp table
 , next I tried to query it using HiveContext, but it seems that hive
 context can not see the registered table suing SQL context. Is this a
 normal case?



 best,

 /Shahab





 On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hive UDF are only applicable for HiveContext and its subclass instance,
 is the CassandraAwareSQLContext a direct sub class of HiveContext or
 SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
 Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

  rdd.first  //enforce caching

  val q = select  from_unixtime(floor(createdAt/1000)) from profile
 where sampling_bucket=0 

  val rdd2 = rdd.sqlContext.sql(q )

  println (Result:  + rdd2.first)



 And I get the following  errors:

 xception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling, profile,
 org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None, None,
 false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
Thanks Rohit,

I am already using Calliope and quite happy with it, well done ! except the
fact that :
1- It seems that it does not support Hive 0.12 or higher, Am i right?  for
example you can not use : current_time() UDF, or those new UDFs added in
hive 0.12 . Are they supported? Any plan for supporting them?
2-It does not support Spark 1.1 and 1.2. Any plan for new release?

best,
/Shahab

On Tue, Mar 3, 2015 at 5:41 PM, Rohit Rai ro...@tuplejump.com wrote:

 Hello Shahab,

 I think CassandraAwareHiveContext
 https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/org/apache/spark/sql/hive/CassandraAwareHiveContext.scala
  in
 Calliopee is what you are looking for. Create CAHC instance and you should
 be able to run hive functions against the SchemaRDD you create from there.

 Cheers,
 Rohit

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*

 On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The temp table in metastore can not be shared cross SQLContext
 instances, since HiveContext is a sub class of SQLContext (inherits all of
 its functionality), why not using a single HiveContext globally? Is there
 any specific requirement in your case that you need multiple
 SQLContext/HiveContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 9:46 PM

 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



 You are right ,  CassandraAwareSQLContext is subclass of SQL context.



 But I did another experiment, I queried Cassandra
 using CassandraAwareSQLContext, then I registered the rdd as a temp table
 , next I tried to query it using HiveContext, but it seems that hive
 context can not see the registered table suing SQL context. Is this a
 normal case?



 best,

 /Shahab





 On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hive UDF are only applicable for HiveContext and its subclass instance,
 is the CassandraAwareSQLContext a direct sub class of HiveContext or
 SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
 Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

  rdd.first  //enforce caching

  val q = select  from_unixtime(floor(createdAt/1000)) from profile
 where sampling_bucket=0 

  val rdd2 = rdd.sqlContext.sql(q )

  println (Result:  + rdd2.first)



 And I get the following  errors:

 xception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling, profile,
 org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None, None,
 false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
You are right ,  CassandraAwareSQLContext is subclass of SQL context.

But I did another experiment, I queried Cassandra using
CassandraAwareSQLContext,
then I registered the rdd as a temp table , next I tried to query it
using HiveContext, but it seems that hive context can not see the
registered table suing SQL context. Is this a normal case?

best,
/Shahab


On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hive UDF are only applicable for HiveContext and its subclass instance,
 is the CassandraAwareSQLContext a direct sub class of HiveContext or
 SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
 Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

  rdd.first  //enforce caching

  val q = select  from_unixtime(floor(createdAt/1000)) from profile
 where sampling_bucket=0 

  val rdd2 = rdd.sqlContext.sql(q )

  println (Result:  + rdd2.first)



 And I get the following  errors:

 xception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling, profile,
 org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None, None,
 false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)

 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)

 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)

 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:402)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:402)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:403)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:403)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala

[no subject]

2015-03-03 Thread shahab
 I did an experiment with Hive and SQL context , I queried Cassandra
using CassandraAwareSQLContext
(a custom SQL context from Calliope) , then I registered the rdd as a
temp table , next I tried to query it using HiveContext, but it seems that
hive context can not see the registered table suing SQL context. Is this a
normal case?

Stack trace:

 ERROR hive.ql.metadata.Hive -
NoSuchObjectException(message:default.MyTableName table not found)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(
HiveMetaStore.java:1373)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(
RetryingHMSHandler.java:103)

best,
/Shahab


Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
@Yin: sorry for my mistake, you are right it was added in 1.2, not 0.12.0 ,
 my bad!

On Tue, Mar 3, 2015 at 6:47 PM, shahab shahab.mok...@gmail.com wrote:

 Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually
 running it on spark 1.1)

 But do you mean that even HiveConext of spark (nit Calliope
 CassandraAwareHiveContext) is not supporting Hive 0.12 ??

 best,
 /Shahab

 On Tue, Mar 3, 2015 at 5:55 PM, Rohit Rai ro...@tuplejump.com wrote:

 The Hive dependency comes from spark-hive.

 It does work with Spark 1.1 we will have the 1.2 release later this month.
 On Mar 3, 2015 8:49 AM, shahab shahab.mok...@gmail.com wrote:


 Thanks Rohit,

 I am already using Calliope and quite happy with it, well done ! except
 the fact that :
 1- It seems that it does not support Hive 0.12 or higher, Am i right?
  for example you can not use : current_time() UDF, or those new UDFs added
 in hive 0.12 . Are they supported? Any plan for supporting them?
 2-It does not support Spark 1.1 and 1.2. Any plan for new release?

 best,
 /Shahab

 On Tue, Mar 3, 2015 at 5:41 PM, Rohit Rai ro...@tuplejump.com wrote:

 Hello Shahab,

 I think CassandraAwareHiveContext
 https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/org/apache/spark/sql/hive/CassandraAwareHiveContext.scala
  in
 Calliopee is what you are looking for. Create CAHC instance and you should
 be able to run hive functions against the SchemaRDD you create from there.

 Cheers,
 Rohit

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*

 On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The temp table in metastore can not be shared cross SQLContext
 instances, since HiveContext is a sub class of SQLContext (inherits all of
 its functionality), why not using a single HiveContext globally? Is there
 any specific requirement in your case that you need multiple
 SQLContext/HiveContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 9:46 PM

 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC
 server



 You are right ,  CassandraAwareSQLContext is subclass of SQL context.



 But I did another experiment, I queried Cassandra
 using CassandraAwareSQLContext, then I registered the rdd as a temp 
 table
 , next I tried to query it using HiveContext, but it seems that hive
 context can not see the registered table suing SQL context. Is this a
 normal case?



 best,

 /Shahab





 On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com
 wrote:

  Hive UDF are only applicable for HiveContext and its subclass
 instance, is the CassandraAwareSQLContext a direct sub class of
 HiveContext or SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC
 server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used
 some Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

  rdd.first  //enforce caching

  val q = select  from_unixtime(floor(createdAt/1000)) from
 profile where sampling_bucket=0 

  val rdd2 = rdd.sqlContext.sql(q )

  println (Result:  + rdd2.first)



 And I get the following  errors:

 xception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling,
 profile, org.apache.spark.sql.CassandraAwareSQLContext@778b692d,
 None, None, false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103

Re: Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread shahab
Thanks Michael. I understand now.

best,
/Shahab

On Tue, Mar 3, 2015 at 9:38 PM, Michael Armbrust mich...@databricks.com
wrote:

 As it says in the API docs
 https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD,
 tables created with registerTempTable are local to the context that creates
 them:

 ... The lifetime of this temporary table is tied to the SQLContext
 https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SQLContext.html
  that
 was used to create this SchemaRDD.


 On Tue, Mar 3, 2015 at 5:52 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I did an experiment with Hive and SQL context , I queried Cassandra using 
 CassandraAwareSQLContext
 (a custom SQL context from Calliope) , then I registered the rdd as a
 temp table , next I tried to query it using HiveContext, but it seems that
 hive context can not see the registered table suing SQL context. Is this a
 normal case?

 Stack trace:

  ERROR hive.ql.metadata.Hive -
 NoSuchObjectException(message:default.MyTableName table not found)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)

 at
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)

 best,
 /Shahab






Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually running
it on spark 1.1)

But do you mean that even HiveConext of spark (nit Calliope
CassandraAwareHiveContext) is not supporting Hive 0.12 ??

best,
/Shahab

On Tue, Mar 3, 2015 at 5:55 PM, Rohit Rai ro...@tuplejump.com wrote:

 The Hive dependency comes from spark-hive.

 It does work with Spark 1.1 we will have the 1.2 release later this month.
 On Mar 3, 2015 8:49 AM, shahab shahab.mok...@gmail.com wrote:


 Thanks Rohit,

 I am already using Calliope and quite happy with it, well done ! except
 the fact that :
 1- It seems that it does not support Hive 0.12 or higher, Am i right?
  for example you can not use : current_time() UDF, or those new UDFs added
 in hive 0.12 . Are they supported? Any plan for supporting them?
 2-It does not support Spark 1.1 and 1.2. Any plan for new release?

 best,
 /Shahab

 On Tue, Mar 3, 2015 at 5:41 PM, Rohit Rai ro...@tuplejump.com wrote:

 Hello Shahab,

 I think CassandraAwareHiveContext
 https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/org/apache/spark/sql/hive/CassandraAwareHiveContext.scala
  in
 Calliopee is what you are looking for. Create CAHC instance and you should
 be able to run hive functions against the SchemaRDD you create from there.

 Cheers,
 Rohit

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*

 On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote:

  The temp table in metastore can not be shared cross SQLContext
 instances, since HiveContext is a sub class of SQLContext (inherits all of
 its functionality), why not using a single HiveContext globally? Is there
 any specific requirement in your case that you need multiple
 SQLContext/HiveContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 9:46 PM

 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



 You are right ,  CassandraAwareSQLContext is subclass of SQL context.



 But I did another experiment, I queried Cassandra
 using CassandraAwareSQLContext, then I registered the rdd as a temp table
 , next I tried to query it using HiveContext, but it seems that hive
 context can not see the registered table suing SQL context. Is this a
 normal case?



 best,

 /Shahab





 On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hive UDF are only applicable for HiveContext and its subclass
 instance, is the CassandraAwareSQLContext a direct sub class of
 HiveContext or SQLContext?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 5:10 PM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: Supporting Hive features in Spark SQL Thrift JDBC server



   val sc: SparkContext = new SparkContext(conf)

   val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
 Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

  rdd.first  //enforce caching

  val q = select  from_unixtime(floor(createdAt/1000)) from profile
 where sampling_bucket=0 

  val rdd2 = rdd.sqlContext.sql(q )

  println (Result:  + rdd2.first)



 And I get the following  errors:

 xception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

 Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

  Filter (sampling_bucket#10 = 0)

   Subquery profile

Project
 [company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

 CassandraRelation localhost, 9042, 9160, normaldb_sampling,
 profile, org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None,
 None, false, Some(Configuration: core-default.xml, core-site.xml,
 mapred-default.xml, mapred-site.xml)



 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)

 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala

Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread shahab
Hi,

I did an experiment with Hive and SQL context , I queried Cassandra
using CassandraAwareSQLContext
(a custom SQL context from Calliope) , then I registered the rdd as a
temp table , next I tried to query it using HiveContext, but it seems that
hive context can not see the registered table suing SQL context. Is this a
normal case?

Stack trace:

 ERROR hive.ql.metadata.Hive -
NoSuchObjectException(message:default.MyTableName table not found)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1373)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103)

best,
/Shahab


Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
  val sc: SparkContext = new SparkContext(conf)

  val sqlCassContext = new CassandraAwareSQLContext(sc)  // I used some
Calliope Cassandra Spark connector

 val rdd : SchemaRDD  = sqlCassContext.sql(select * from db.profile  )

 rdd.cache

 rdd.registerTempTable(profile)

 rdd.first  //enforce caching

 val q = select  from_unixtime(floor(createdAt/1000)) from profile
where sampling_bucket=0 

 val rdd2 = rdd.sqlContext.sql(q )

 println (Result:  + rdd2.first)


And I get the following  errors:

xception in thread main
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: 'from_unixtime('floor(('createdAt / 1000))) AS c0#7, tree:

Project ['from_unixtime('floor(('createdAt / 1000))) AS c0#7]

 Filter (sampling_bucket#10 = 0)

  Subquery profile

   Project
[company#8,bucket#9,sampling_bucket#10,profileid#11,createdat#12L,modifiedat#13L,version#14]

CassandraRelation localhost, 9042, 9160, normaldb_sampling, profile,
org.apache.spark.sql.CassandraAwareSQLContext@778b692d, None, None, false,
Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml)


at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(
Analyzer.scala:72)

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(
Analyzer.scala:70)

at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
TreeNode.scala:165)

at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(
TreeNode.scala:183)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

at scala.collection.AbstractIterator.to(Iterator.scala:1157)

at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265
)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(
TreeNode.scala:212)

at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(
TreeNode.scala:168)

at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156
)

at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(
Analyzer.scala:70)

at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(
Analyzer.scala:68)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(
RuleExecutor.scala:61)

at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(
RuleExecutor.scala:59)

at scala.collection.IndexedSeqOptimized$class.foldl(
IndexedSeqOptimized.scala:51)

at scala.collection.IndexedSeqOptimized$class.foldLeft(
IndexedSeqOptimized.scala:60)

at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)

at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(
RuleExecutor.scala:59)

at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(
RuleExecutor.scala:51)

at scala.collection.immutable.List.foreach(List.scala:318)

at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(
RuleExecutor.scala:51)

at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(
SQLContext.scala:402)

at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(
SQLContext.scala:402)

at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(
SQLContext.scala:403)

at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(
SQLContext.scala:403)

at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(
SQLContext.scala:407)

at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(
SQLContext.scala:405)

at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(
SQLContext.scala:411)

at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(
SQLContext.scala:411)

at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)

at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:440)

at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:103)

at org.apache.spark.rdd.RDD.first(RDD.scala:1091)

at boot.SQLDemo$.main(SQLDemo.scala:65)  //my code

 at boot.SQLDemo.main(SQLDemo.scala)  //my code

On Tue, Mar 3, 2015 at 8:57 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Can you provide the detailed failure call stack?



 *From:* shahab [mailto:shahab.mok...@gmail.com]
 *Sent:* Tuesday, March 3, 2015 3:52 PM
 *To:* user@spark.apache.org
 *Subject

Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-02 Thread shahab
Hi,

According to Spark SQL documentation, Spark SQL supports the vast
majority of Hive features, such as  User Defined Functions( UDF) , and one
of these UFDs is current_date() function, which should be supported.

However, i get error when I am using this UDF in my SQL query. There are
couple of other UDFs which cause similar error.

Am I missing something in my JDBC server ?

/Shahab


Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-03-01 Thread shahab
Thanks Vijay, but the setup requirement for GML was not straightforward for
me at all, so I put it aside for a while.

best,
/Shahab

On Sun, Mar 1, 2015 at 9:34 AM, Vijay Saraswat vi...@saraswat.org wrote:

  GML is a fast, distributed, in-memory sparse (and dense) matrix
 libraries.

 It does not use RDDs for resilience. Instead we have examples that use
 Resilient X10 (which provides recovery of distributed control structures in
 case of node failure) and Hazelcast for stable storage.

 We are looking to benchmark with RDDs to compare overhead, and also
 looking to see how the same ideas could be realized on top of RDDs.



 On 2/28/15 7:25 PM, Joseph Bradley wrote:

 Hi Shahab,

  There are actually a few distributed Matrix types which support sparse
 representations: RowMatrix, IndexedRowMatrix, and CoordinateMatrix.
 The documentation has a bit more info about the various uses:
 http://spark.apache.org/docs/latest/mllib-data-types.html#distributed-matrix

  The Spark 1.3 RC includes a new one: BlockMatrix.

  But since these are distributed, they are represented using RDDs, so
 they of course will not be as fast as computations on smaller, locally
 stored matrices.

  Joseph

 On Fri, Feb 27, 2015 at 4:39 AM, Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:

 try using breeze (scala linear algebra library)

 On Fri, Feb 27, 2015 at 5:56 PM, shahab shahab.mok...@gmail.com wrote:

 Thanks a lot Vijay, let me see how it performs.

  Best
 Shahab


 On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote:

 Available in GML --


 http://x10-lang.org/x10-community/applications/global-matrix-library.html

 We are exploring how to make it available within Spark. Any ideas would
 be much appreciated.

 On 2/27/15 7:01 AM, shahab wrote:

 Hi,

 I just wonder if there is any Sparse Matrix implementation available
 in Spark, so it can be used in spark application?

 best,
 /Shahab



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-03-01 Thread shahab
Thanks Josef for the comments, I think I need to do some benchmarking.

best,
/Shahab

On Sun, Mar 1, 2015 at 1:25 AM, Joseph Bradley jos...@databricks.com
wrote:

 Hi Shahab,

 There are actually a few distributed Matrix types which support sparse
 representations: RowMatrix, IndexedRowMatrix, and CoordinateMatrix.
 The documentation has a bit more info about the various uses:
 http://spark.apache.org/docs/latest/mllib-data-types.html#distributed-matrix

 The Spark 1.3 RC includes a new one: BlockMatrix.

 But since these are distributed, they are represented using RDDs, so they
 of course will not be as fast as computations on smaller, locally stored
 matrices.

 Joseph

 On Fri, Feb 27, 2015 at 4:39 AM, Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:

 try using breeze (scala linear algebra library)

 On Fri, Feb 27, 2015 at 5:56 PM, shahab shahab.mok...@gmail.com wrote:

 Thanks a lot Vijay, let me see how it performs.

 Best
 Shahab


 On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote:

 Available in GML --

 http://x10-lang.org/x10-community/applications/global-
 matrix-library.html

 We are exploring how to make it available within Spark. Any ideas would
 be much appreciated.

 On 2/27/15 7:01 AM, shahab wrote:

 Hi,

 I just wonder if there is any Sparse Matrix implementation available
 in Spark, so it can be used in spark application?

 best,
 /Shahab



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Thanks a lot Vijay, let me see how it performs.

Best
Shahab

On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote:

 Available in GML --

 http://x10-lang.org/x10-community/applications/global-matrix-library.html

 We are exploring how to make it available within Spark. Any ideas would be
 much appreciated.

 On 2/27/15 7:01 AM, shahab wrote:

 Hi,

 I just wonder if there is any Sparse Matrix implementation available  in
 Spark, so it can be used in spark application?

 best,
 /Shahab



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Thanks,

But do you know if access to Coordinated matrix elements is almost as fast
as a normal matrix or it has access time similar to RDD ( relatively slow)?
I am looking for some fast access sparse matrix data structure.



On Friday, February 27, 2015, Peter Rudenko petro.rude...@gmail.com wrote:

  Yes, it's called Coordinated Matrix(
 http://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix)
 you need to fill it with elemets of type MatrixEntry( (Long, Long,
 Double))


 Thanks,
 Peter Rudenko
 On 2015-02-27 14:01, shahab wrote:

 Hi,

  I just wonder if there is any Sparse Matrix implementation available  in
 Spark, so it can be used in spark application?

  best,
 /Shahab





Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Hi,

I just wonder if there is any Sparse Matrix implementation available  in
Spark, so it can be used in spark application?

best,
/Shahab


Re: what does Submitting ... missing tasks from Stage mean?

2015-02-23 Thread shahab
Thanks Imran, but I do appreciate if you explain what this mean and what
are the reasons make it happening. I do need it.
If there is any documentation somewhere you can simply direct me there so
 I can try to understand it myself.

best,
/Shahab

On Sat, Feb 21, 2015 at 12:26 AM, Imran Rashid iras...@cloudera.com wrote:

 yeah, this is just the totally normal message when spark executes
 something.  The first time something is run, all of its tasks are
 missing.  I would not worry about cases when all tasks aren't missing
 if you're new to spark, its probably an advanced concept that you don't
 care about.  (and would take me some time to explain :)

 On Fri, Feb 20, 2015 at 8:20 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 Probably this is silly question, but I couldn't find any clear
 documentation explaining why  one  should submitting... missing tasks from
 Stage ... in the logs?

 Specially in my case when I do not have any failure in job execution, I
 wonder why this should happen?
 Does it have any relation to lazy evaluation?

 best,
 /Shahab





Access time to an elemnt in cached RDD

2015-02-23 Thread shahab
Hi,

I just wonder what would be the access time to take one element from a
cached RDD? if I have understood correctly, access to RDD elements is not
as fast as accessing e.g. HashMap and it could take up to  mili seconds
compare to nano seconds in HashMap, which is quite significant difference
if you plan for near real-time response from Spark ?!

best,

/Shahab


Re: Why is RDD lookup slow?

2015-02-20 Thread shahab
Thanks you all. Just changing RDD to Map  structure saved me approx. 1
second.

Yes, I will check out IndexedRDD to see if it has better performance.

best,
/Shahab

On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz brk...@gmail.com wrote:

 If your dataset is large, there is a Spark Package called IndexedRDD
 optimized for lookups. Feel free to check that out.

 Burak
 On Feb 19, 2015 7:37 AM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi Shahab - if your data structures are small enough a broadcasted Map is
 going to provide faster lookup. Lookup within an RDD is an O(m) operation
 where m is the size of the partition. For RDDs with multiple partitions,
 executors can operate on it in parallel so you get some improvement for
 larger RDDs.
 On Thu, Feb 19, 2015 at 7:31 AM shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
 lookup is relatively slow 30-100 ms ?? I even tried this on one machine
 with single partition, but no difference!

 The RDDs are not large at all, 3-30 MB.

 Is this expected behaviour? should I use other data structures, like
 HashMap to keep data and look up it there and use Broadcast to send a copy
 to all machines?

 best,
 /Shahab





what does Submitting ... missing tasks from Stage mean?

2015-02-20 Thread shahab
Hi,

Probably this is silly question, but I couldn't find any clear
documentation explaining why  one  should submitting... missing tasks from
Stage ... in the logs?

Specially in my case when I do not have any failure in job execution, I
wonder why this should happen?
Does it have any relation to lazy evaluation?

best,
/Shahab


Why is RDD lookup slow?

2015-02-19 Thread shahab
Hi,

I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
lookup is relatively slow 30-100 ms ?? I even tried this on one machine
with single partition, but no difference!

The RDDs are not large at all, 3-30 MB.

Is this expected behaviour? should I use other data structures, like
HashMap to keep data and look up it there and use Broadcast to send a copy
to all machines?

best,
/Shahab


Re: Why groupBy is slow?

2015-02-18 Thread shahab
Thanks  Francois for the comment and useful link. I understand the problem
better now.

best,
/Shahab

On Wed, Feb 18, 2015 at 10:36 AM, francois.garil...@typesafe.com wrote:

 In a nutshell : because it’s moving all of your data, compared to other
 operations (e.g. reduce) that summarize it in one form or another before
 moving it.

 For the longer answer:

 http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

 —
 FG


 On Wed, Feb 18, 2015 at 10:33 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 Based on what I could see in the Spark UI, I noticed that  groupBy
 transformation is quite slow (taking a lot of time) compared to other
 operations.

 Is there any reason that groupBy is slow?

 shahab





Why groupBy is slow?

2015-02-18 Thread shahab
Hi,

Based on what I could see in the Spark UI, I noticed that  groupBy
transformation is quite slow (taking a lot of time) compared to other
operations.

Is there any reason that groupBy is slow?

shahab


Re: Why cached RDD is recomputed again?

2015-02-18 Thread shahab
Thanks Sean, but I don't think that fitting into memory  is the case,
because:
1- I can see in the UI that 100% of RDD is cached, (moreover the RDD is
quite small, 100 MB, while worker has 1.5 GB)
2- I also tried  MEMORY_AND_DISK, but absolutely no difference !

Probably I have messed up somewhere else!
Do you have any other idea where I should look for the cause?

best,
/Shahab

On Wed, Feb 18, 2015 at 4:22 PM, Sean Owen so...@cloudera.com wrote:

 The mostly likely explanation is that you wanted to put all the
 partitions in memory and they don't all fit. Unless you asked to
 persist to memory or disk, some partitions will simply not be cached.

 Consider using MEMORY_OR_DISK persistence.

 This can also happen if blocks were lost due to node failure.

 On Wed, Feb 18, 2015 at 3:19 PM, shahab shahab.mok...@gmail.com wrote:
  Hi,
 
  I have a cached RDD (I can see in UI that it is cached), but when I use
 this
  RDD , I can see that the RDD is partially recomputed (computed) again.
 It is
  partially because I can see in UI that some task are skipped (have a
 look
  at the attached figure).
 
  Now the question is 1: what causes a cached RDD to be recomputed again?
 and
  why somes tasks are skipped and some not??
 
  best,
  /Shahab
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org



How to unregister/re-register a TempTable in Spark?

2015-01-28 Thread shahab
Hi,

I just wonder if there is any way to unregister/re-register a TempTable in
Spark?

best,
/Shahab


Querying Temp table using JDBC

2014-12-19 Thread shahab
Hi,

According to Spark documentation the data sharing between two different
Spark contexts is not possible.

So I just wonder if it is possible to first run a job that loads some data
from DB into Schema RDDs, then  cache it and  next register it as a temp
table (let's say Table_1), now I would like to open a JDBC connection
(assuming that I have setup JDBC server  on the same cluster, so it is
connected to same Master) and perform a SQL query on Table_1 .

Is the above scenario feasible in Spark? or simply these two tasks belong
to two different Spark contexts and therefore not runnable?


best,
/Shahab


Querying registered RDD (AsTable) using JDBC

2014-12-19 Thread shahab
Hi,

Sorry for repeating the same question, just wanted to clarify the issue :

Is it possible to expose a RDD (or SchemaRDD) to external components
(outside spark) so it can  be queried over JDBC (my goal is not to place
the RDD back in a database  but use this cached RDD to server JDBC queries)
?

best,

/shahab


HiveQL support in Cassandra-Spark connector

2014-12-15 Thread shahab
Hi,

I just wonder if Cassandra-Spark connector supports executing HiveQL on
Cassandra tables?

best,
/Shahab


Increasing the number of retry in case of job failure

2014-12-05 Thread shahab
Hello,

By some (unknown) reasons some of my tasks, that fetch data from Cassandra,
are failing so often, and apparently the master removes a tasks which fails
more than 4 times (in my case).

Is there any way to increase the number of re-tries ?

best,
/Shahab


How to enforce RDD to be cached?

2014-12-03 Thread shahab
Hi,

I noticed that rdd.cache() is not happening immediately rather due to lazy
feature of Spark, it is happening just at the moment  you perform some
map/reduce actions. Is this true?

If this is the case, how can I enforce Spark to cache immediately at its
cache() statement? I need this to perform some benchmarking and I need to
separate rdd caching and rdd transformation/action processing time.

best,
/Shahab


Re: How to enforce RDD to be cached?

2014-12-03 Thread shahab
Daniel and Paolo, thanks for the comments.

best,
/Shahab

On Wed, Dec 3, 2014 at 3:12 PM, Paolo Platter paolo.plat...@agilelab.it
wrote:

  Yes,

  otherwise you can try:

  rdd.cache().count()

  and then run your benchmark

  Paolo

   *Da:* Daniel Darabos daniel.dara...@lynxanalytics.com
 *Data invio:* ‎mercoledì‎ ‎3‎ ‎dicembre‎ ‎2014 ‎12‎:‎28
 *A:* shahab shahab.mok...@gmail.com
 *Cc:* user@spark.apache.org



 On Wed, Dec 3, 2014 at 10:52 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

  I noticed that rdd.cache() is not happening immediately rather due to
 lazy feature of Spark, it is happening just at the moment  you perform some
 map/reduce actions. Is this true?


  Yes, this is correct.

   If this is the case, how can I enforce Spark to cache immediately at
 its cache() statement? I need this to perform some benchmarking and I need
 to separate rdd caching and rdd transformation/action processing time.


  The typical solution I think is to run rdd.foreach(_ = ()) to trigger a
 calculation.



Kryo exception for CassandraSQLRow

2014-12-01 Thread shahab
I am using Cassandra-Spark connector to pull data from Cassandra, process
it and write it back to Cassandra.

 Now I am  getting the following exception, and apparently it is Kryo
serialisation. Does anyone what is the reason and how this can be solved?

I also tried to register org.apache.spark.sql.cassandra.CassandraSQLRow
in  kryo.register , but even this did not solve the problem and exception
remains.

WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 7,
ip-X-Y-Z): com.esotericsoftware.kryo.KryoException: Unable to find class:
org.apache.spark.sql.cassandra.CassandraSQLRow
Serialization trace:
_2 (org.apache.spark.util.MutablePair)

com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)

com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:599)

com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1171)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1218)
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:904)
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:904)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)



I am using  Spark 1.1.0 with cassandra-spark connector 1.1.0 , here is the
build:

   org.apache.spark % spark-mllib_2.10 % 1.1.0
exclude(com.google.guava, guava),

com.google.guava % guava % 16.0 % provided,

com.datastax.spark %% spark-cassandra-connector % 1.1.0
exclude(com.google.guava, guava)   withSources() withJavadoc(),

org.apache.cassandra % cassandra-all % 2.1.1
exclude(com.google.guava, guava) ,

org.apache.cassandra % cassandra-thrift % 2.1.1
exclude(com.google.guava, guava) ,

com.datastax.cassandra % cassandra-driver-core % 2.1.2
exclude(com.google.guava, guava) ,

org.apache.spark %% spark-core % 1.1.0 % provided
exclude(com.google.guava, guava) exclude(org.apache.hadoop, hadoop
-core),

org.apache.spark %% spark-streaming % 1.1.0 % provided
exclude(com.google.guava, guava),

org.apache.spark %% spark-catalyst   % 1.1.0  % provided
exclude(com.google.guava, guava) exclude(org.apache.spark,
spark-core),

 org.apache.spark %% spark-sql % 1.1.0 %  provided
exclude(com.google.guava, guava) exclude(org.apache.spark,
spark-core),

org.apache.spark %% spark-hive % 1.1.0 % provided
exclude(com.google.guava, guava) exclude(org.apache.spark,
spark-core),

org.apache.hadoop % hadoop-client % 1.0.4 % provided,

best,
/Shahab


Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread shahab
Hi,

I just wonder if there is any implementation for Item-based Collaborative
Filtering in Spark?

best,
/Shahab


Re: How to assign consecutive numeric id to each row based on its content?

2014-11-25 Thread shahab
Thanks a lot, both solutions work.

best,
/Shahab

On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
 increment them like so:

 val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)

 If the number of distinct keys is relatively small, you might consider
 collecting them into a map and broadcasting them rather than using a join,
 like so:

 val keyIndices = sc.broadcast(r2.collect.toMap)
 val r3 = r1.map { case (k, v) = (keyIndices(k), v) }

 On Tue, Nov 18, 2014 at 8:16 AM, Cheng Lian lian.cs@gmail.com wrote:

  A not so efficient way can be this:

 val r0: RDD[OriginalRow] = ...val r1 = r0.keyBy(row = 
 extractKeyFromOriginalRow(row))val r2 = r1.keys.distinct().zipWithIndex()val 
 r3 = r2.join(r1).values

 On 11/18/14 8:54 PM, shahab wrote:

   Hi,

  In my spark application, I am loading some rows from database into
 Spark RDDs
 Each row has several fields, and a string key. Due to my requirements I
 need to work with consecutive numeric ids (starting from 1 to N, where N is
 the number of unique keys) instead of string keys . Also several rows can
 have same string key .

  In spark context, how I can map each row into (Numeric_Key,
 OriginalRow) as map/reduce  tasks such that rows with same original string
 key get same numeric consecutive key?

  Any hints?

  best,
 /Shahab

   ​




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



Re: Spark Cassandra Guava version issues

2014-11-24 Thread shahab
I faced same problem, and s work around  solution is here :
https://github.com/datastax/spark-cassandra-connector/issues/292

best,
/Shahab


On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote:

 I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
 sbt-assembly to create a uber jar to submit to the stand alone master. I'm
 using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do
 sc.CassandraTable(...) I get an error that's likely to be a Guava
 versioning issue. I'm using the Spark Cassandra connector v 1.1.0-rc2 which
 just came out, though the issue was in rc1 as well. I can't see the
 cassandra connector using Guava directly, so I guess it's a dependency for
 some other thing that the cassandra spark connector is using. Does anybody
 have a workaround for this?

 The sbt file and the exception are given below.

 Regards,
 Ashic.


 sbt file:

 import sbt._
 import Keys._
 import sbtassembly.Plugin._
 import AssemblyKeys._

 assemblySettings

 name := foo

 version := 0.1.0

 scalaVersion := 2.10.4

 libraryDependencies ++= Seq (
   org.apache.spark %% spark-core % 1.1.0 % provided,
   org.apache.spark %% spark-sql % 1.1.0 % provided,
   com.datastax.spark %% spark-cassandra-connector % 1.1.0-rc2 
 withSources() withJavadoc(),
   org.specs2 %% specs2 % 2.4 % test withSources()
 )

 //allow provided for run
 run in Compile = Defaults.runTask(fullClasspath in Compile, mainClass in 
 (Compile, run), runner in (Compile, run))

 mergeStrategy in assembly := {
   case PathList(META-INF, xs @ _*) =
 (xs map {_.toLowerCase}) match {
   case (manifest.mf :: Nil) | (index.list :: Nil) | (dependencies 
 :: Nil) = MergeStrategy.discard
   case _ = MergeStrategy.discard
 }
   case _ = MergeStrategy.first
 }

 resolvers += Akka Repository at http://repo.akka.io/releases/;

 test in assembly := {}


 Exception:
 14/11/24 14:20:11 INFO client.AppClient$ClientActor: Executor updated:
 app-20141124142008-0001/0 is now RUNNING
 Exception in thread main java.lang.NoSuchMethodError:
 com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
 at
 com.datastax.driver.core.Cluster$ConnectionReaper.init(Cluster.java:2065)
 at
 com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1163)
 at
 com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1110)
 at com.datastax.driver.core.Cluster.init(Cluster.java:118)
 at com.datastax.driver.core.Cluster.init(Cluster.java:105)
 at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:174)
 at
 com.datastax.driver.core.Cluster$Builder.build(Cluster.java:1075)
 at
 com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:81)
 at
 com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:165)
 at
 com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:160)
 at
 com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:160)
 at
 com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:36)
 at
 com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:61)
 at
 com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:71)
 at
 com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:97)
 at
 com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:108)
 at
 com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:134)
 at
 com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:227)
 at
 com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:226)
 at
 com.datastax.spark.connector.rdd.CassandraRDD.verify$lzycompute(CassandraRDD.scala:266)
 at
 com.datastax.spark.connector.rdd.CassandraRDD.verify(CassandraRDD.scala:263)
 at
 com.datastax.spark.connector.rdd.CassandraRDD.getPartitions(CassandraRDD.scala:292)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28

How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread shahab
Hi,

In my spark application, I am loading some rows from database into Spark
RDDs
Each row has several fields, and a string key. Due to my requirements I
need to work with consecutive numeric ids (starting from 1 to N, where N is
the number of unique keys) instead of string keys . Also several rows can
have same string key .

In spark context, how I can map each row into (Numeric_Key, OriginalRow) as
map/reduce  tasks such that rows with same original string key get same
numeric consecutive key?

Any hints?

best,
/Shahab


Cassandra spark connector exception: NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;

2014-11-11 Thread shahab
Hi,

I  have a spark application which uses Cassandra
connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar to load
data from Cassandra into spark.

Everything works fine in the local mode, when I run in my IDE. But when I
submit the application to be executed in standalone Spark server, I get the
following exception, (which is apparently related to Guava versions ???!).
Does any one know how to solve this?

I create a jar file of my spark application using assembly.bat, and the
followings is the dependencies I used:

I put the connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.ja
in the lib/ folder of my eclipse project thats why it is not included in
the dependencies

libraryDependencies ++= Seq(

org.apache.spark%% spark-catalyst% 1.1.0 %
provided,

org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

net.jpountz.lz4 % lz4 % 1.2.0,

org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-
api) exclude(javax.servlet, servlet-api),

com.datastax.cassandra % cassandra-driver-core % 2.0.4
intransitive(),

org.apache.spark %% spark-core % 1.1.0 % provided
exclude(org.apache.hadoop, hadoop-core),

org.apache.spark %% spark-streaming % 1.1.0 % provided,

org.apache.hadoop % hadoop-client % 1.0.4 % provided,

com.github.nscala-time %% nscala-time % 1.0.0,

org.scalatest %% scalatest % 1.9.1 % test,

org.apache.spark %% spark-sql % 1.1.0 %  provided,

org.apache.spark %% spark-hive % 1.1.0 % provided,

org.json4s %% json4s-jackson % 3.2.5,

junit % junit % 4.8.1 % test,

org.slf4j % slf4j-api % 1.7.7,

org.slf4j % slf4j-simple % 1.7.7,

org.clapper %% grizzled-slf4j % 1.0.2,

log4j % log4j % 1.2.17,

 com.google.guava % guava  % 16.0

   )

best,

/Shahab
And this is the exception I get:

Exception in thread main
com.google.common.util.concurrent.ExecutionError:
java.lang.NoSuchMethodError:
com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at
org.apache.spark.sql.cassandra.CassandraCatalog.lookupRelation(CassandraCatalog.scala:39)
at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.org
$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(CassandraSQLContext.scala:60)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:123)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:123)
at
org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.lookupRelation(CassandraSQLContext.scala:65)


How number of partitions effect the performance?

2014-11-03 Thread shahab
Hi,

I just wonder how number of partitions effect the performance in Spark!

Is it just the parallelism (more partitions, more parallel sub-tasks) that
improves the performance? or there exist other considerations?

In my case,I run couple of map/reduce jobs on same dataset two times with
two different partition numbers, 7 and 9. I used a stand alone cluster,
with two workers on each, where the master resides with the same machine as
one of the workers.

Surprisingly, the performance of map/reduce jobs in case of 9 partitions is
almost  4X-5X better than that of 7 partitions !??  Does it mean that
choosing right number of partitions is the key factor in the Spark
performance ?

best,
/Shahab


Re: How number of partitions effect the performance?

2014-11-03 Thread shahab
Thanks Sean for very useful comments. I understand now better what could be
the reasons that my evaluations are messed up.

best,
/Shahab

On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen so...@cloudera.com wrote:

 Yes partitions matter. Usually you can use the default, which will
 make a partition per input split, and that's usually good, to let one
 task process one block of data, which will all be on one machine.

 Reasons I could imagine why 9 partitions is faster than 7:

 Probably: Your cluster can execute at least 9 tasks concurrently. It
 will finish faster since each partition is smaller when split into 9
 partitions. This just means you weren't using your cluster's full
 parallelism at 7.

 9 partitions lets tasks execute entirely locally to the data, whereas
 7 is too few compared to how the data blocks are distributed on HDFS.
 That is, maybe 7 is inducing a shuffle whereas 9 is not for some
 reason in your code.

 Your executors are running near their memory limit and are thrashing
 in GC. With less data to process each, you may avoid thrashing and so
 go a lot faster.

 (Or there's some other factor that messed up your measurements :))


 There can be instances where more partitions is slower too.

 On Mon, Nov 3, 2014 at 9:57 AM, shahab shahab.mok...@gmail.com wrote:
  Hi,
 
  I just wonder how number of partitions effect the performance in Spark!
 
  Is it just the parallelism (more partitions, more parallel sub-tasks)
 that
  improves the performance? or there exist other considerations?
 
  In my case,I run couple of map/reduce jobs on same dataset two times with
  two different partition numbers, 7 and 9. I used a stand alone cluster,
 with
  two workers on each, where the master resides with the same machine as
 one
  of the workers.
 
  Surprisingly, the performance of map/reduce jobs in case of 9 partitions
 is
  almost  4X-5X better than that of 7 partitions !??  Does it mean that
  choosing right number of partitions is the key factor in the Spark
  performance ?
 
  best,
  /Shahab



Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Hi,

I am using the latest Cassandra-Spark Connector  to access Cassandra tables
form Spark. While I successfully managed to connect Cassandra using
CassandraRDD, the similar SparkSQL approach does not work. Here is my code
for both methods:

import com.datastax.spark.connector._

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.sql._;

import org.apache.spark.SparkContext._

import org.apache.spark.sql.catalyst.expressions._

import com.datastax.spark.connector.cql.CassandraConnector

import org.apache.spark.sql.cassandra.CassandraSQLContext


  val conf = new SparkConf().setAppName(SomethingElse)

   .setMaster(local)

.set(spark.cassandra.connection.host, localhost)

val sc: SparkContext = new SparkContext(conf)

  val rdd = sc.cassandraTable(mydb, mytable)  // this works

But:

val cc = new CassandraSQLContext(sc)

 cc.setKeyspace(mydb)

 val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )

println (count :  +  srdd.count) // does not work

Exception is thrown:

Exception in thread main
com.google.common.util.concurrent.UncheckedExecutionException:
java.util.NoSuchElementException: key not found: mydb3.inverseeventtype

at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)

at com.google.common.cache.LocalCache.get(LocalCache.java:3934)

 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)




in fact mydb3 is anothery keyspace which I did not tried even to connect to
it !


Any idea?


best,

/Shahab


Here is how my SBT looks like:

libraryDependencies ++= Seq(

com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
withSources() withJavadoc(),

org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

net.jpountz.lz4 % lz4 % 1.2.0,

org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-
api) exclude(javax.servlet, servlet-api),

com.datastax.cassandra % cassandra-driver-core % 2.0.4
intransitive(),

org.apache.spark %% spark-core % 1.1.0 % provided
exclude(org.apache.hadoop, hadoop-core),

org.apache.spark %% spark-streaming % 1.1.0 % provided,

org.apache.hadoop % hadoop-client % 1.0.4 % provided,

com.github.nscala-time %% nscala-time % 1.0.0,

org.scalatest %% scalatest % 1.9.1 % test,

org.apache.spark %% spark-sql % 1.1.0 %  provided,

org.apache.spark %% spark-hive % 1.1.0 % provided,

org.json4s %% json4s-jackson % 3.2.5,

junit % junit % 4.8.1 % test,

org.slf4j % slf4j-api % 1.7.7,

org.slf4j % slf4j-simple % 1.7.7,

org.clapper %% grizzled-slf4j % 1.0.2,

log4j % log4j % 1.2.17)


Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Thanks Helena.
I tried setting the KeySpace, but I got same result. I also removed other
Cassandra dependencies,  but still same exception!
I also tried to see if this setting appears in the CassandraSQLContext or
not, so I printed out the output of configustion

val cc = new CassandraSQLContext(sc)

cc.setKeyspace(mydb)

cc.conf.getAll.foreach(f = println (f._1  +  :  +  f._2))

printout:

spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2

spark.driver.host :192.168.1.111

spark.cassandra.connection.host : localhost

spark.cassandra.input.split.size : 1

spark.app.name : SomethingElse

spark.fileserver.uri :  http://192.168.1.111:51463

spark.driver.port : 51461

spark.master :  local

Does it have anything to do with the version of Apache Cassandra that I
use?? I use apache-cassandra-2.1.0


best,
/Shahab

The shortened SBT :

com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
withSources() withJavadoc(),

net.jpountz.lz4 % lz4 % 1.2.0,

org.apache.spark %% spark-core % 1.1.0 % provided
exclude(org.apache.hadoop, hadoop-core),

org.apache.spark %% spark-streaming % 1.1.0 % provided,

org.apache.hadoop % hadoop-client % 1.0.4 % provided,

com.github.nscala-time %% nscala-time % 1.0.0,

org.scalatest %% scalatest % 1.9.1 % test,

org.apache.spark %% spark-sql % 1.1.0 %  provided,

org.apache.spark %% spark-hive % 1.1.0 % provided,

org.json4s %% json4s-jackson % 3.2.5,

junit % junit % 4.8.1 % test,

org.slf4j % slf4j-api % 1.7.7,

org.slf4j % slf4j-simple % 1.7.7,

org.clapper %% grizzled-slf4j % 1.0.2,

log4j % log4j % 1.2.17

On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Hi Shahab,

 I’m just curious, are you explicitly needing to use thrift? Just using the
 connector with spark does not require any thrift dependencies.
 Simply: com.datastax.spark %% spark-cassandra-connector %
 1.1.0-beta1”

 But to your question, you declare the keyspace but also unnecessarily
 repeat the keyspace.table in your select.
 Try this instead:

 val cc = new CassandraSQLContext(sc)
 cc.setKeyspace(“keyspaceName)
 val result = cc.sql(SELECT * FROM tableName”) etc

 - Helena
 @helenaedelson

 On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am using the latest Cassandra-Spark Connector  to access Cassandra
 tables form Spark. While I successfully managed to connect Cassandra using
 CassandraRDD, the similar SparkSQL approach does not work. Here is my code
 for both methods:

 import com.datastax.spark.connector._

 import org.apache.spark.{SparkConf, SparkContext}

 import org.apache.spark.sql._;

 import org.apache.spark.SparkContext._

 import org.apache.spark.sql.catalyst.expressions._

 import com.datastax.spark.connector.cql.CassandraConnector

 import org.apache.spark.sql.cassandra.CassandraSQLContext


   val conf = new SparkConf().setAppName(SomethingElse)

.setMaster(local)

 .set(spark.cassandra.connection.host, localhost)

 val sc: SparkContext = new SparkContext(conf)

  val rdd = sc.cassandraTable(mydb, mytable)  // this works

 But:

 val cc = new CassandraSQLContext(sc)

  cc.setKeyspace(mydb)

  val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )

 println (count :  +  srdd.count) // does not work

 Exception is thrown:

 Exception in thread main
 com.google.common.util.concurrent.UncheckedExecutionException:
 java.util.NoSuchElementException: key not found: mydb3.inverseeventtype

 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)

 at com.google.common.cache.LocalCache.get(LocalCache.java:3934)

 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)

 


 in fact mydb3 is anothery keyspace which I did not tried even to connect
 to it !


 Any idea?


 best,

 /Shahab


 Here is how my SBT looks like:

 libraryDependencies ++= Seq(

 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
 withSources() withJavadoc(),

 org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

 org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

 net.jpountz.lz4 % lz4 % 1.2.0,

 org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j,
 slf4j-api) exclude(javax.servlet, servlet-api),

 com.datastax.cassandra % cassandra-driver-core % 2.0.4
 intransitive(),

 org.apache.spark %% spark-core % 1.1.0 % provided
 exclude(org.apache.hadoop, hadoop-core),

 org.apache.spark %% spark-streaming % 1.1.0 % provided,

 org.apache.hadoop % hadoop-client % 1.0.4 % provided,

 com.github.nscala-time %% nscala-time % 1.0.0,

 org.scalatest %% scalatest % 1.9.1 % test,

 org.apache.spark %% spark-sql % 1.1.0 %  provided,

 org.apache.spark %% spark-hive % 1.1.0 % provided,

 org.json4s %% json4s-jackson % 3.2.5,

 junit % junit % 4.8.1 % test,

 org.slf4j % slf4j-api % 1.7.7,

 org.slf4j % slf4j-simple

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
OK, I created an issue. Hopefully it will be resolved soon.

Again thanks,

best,
/Shahab

On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Hi Shahab,
 The apache cassandra version looks great.

 I think that doing
 cc.setKeyspace(mydb)
 cc.sql(SELECT * FROM mytable)

 versus
 cc.setKeyspace(mydb)
 cc.sql(select * from mydb.mytable )

 Is the problem? And if not, would you mind creating a ticket off-list for
 us to help further? You can create one here:
 https://github.com/datastax/spark-cassandra-connector/issues
 with tag: help wanted :)

 Cheers,

 - Helena
 @helenaedelson

 On Oct 31, 2014, at 1:59 PM, shahab shahab.mok...@gmail.com wrote:

 Thanks Helena.
 I tried setting the KeySpace, but I got same result. I also removed
 other Cassandra dependencies,  but still same exception!
 I also tried to see if this setting appears in the CassandraSQLContext or
 not, so I printed out the output of configustion

 val cc = new CassandraSQLContext(sc)

 cc.setKeyspace(mydb)

 cc.conf.getAll.foreach(f = println (f._1  +  :  +  f._2))

 printout:

 spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2

 spark.driver.host :192.168.1.111

 spark.cassandra.connection.host : localhost

 spark.cassandra.input.split.size : 1

 spark.app.name : SomethingElse

 spark.fileserver.uri :  http://192.168.1.111:51463

 spark.driver.port : 51461

 spark.master :  local

 Does it have anything to do with the version of Apache Cassandra that I
 use?? I use apache-cassandra-2.1.0


 best,
 /Shahab

 The shortened SBT :

 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
 withSources() withJavadoc(),

 net.jpountz.lz4 % lz4 % 1.2.0,

 org.apache.spark %% spark-core % 1.1.0 % provided
 exclude(org.apache.hadoop, hadoop-core),

 org.apache.spark %% spark-streaming % 1.1.0 % provided,

 org.apache.hadoop % hadoop-client % 1.0.4 % provided,

 com.github.nscala-time %% nscala-time % 1.0.0,

 org.scalatest %% scalatest % 1.9.1 % test,

 org.apache.spark %% spark-sql % 1.1.0 %  provided,

 org.apache.spark %% spark-hive % 1.1.0 % provided,

 org.json4s %% json4s-jackson % 3.2.5,

 junit % junit % 4.8.1 % test,

 org.slf4j % slf4j-api % 1.7.7,

 org.slf4j % slf4j-simple % 1.7.7,

 org.clapper %% grizzled-slf4j % 1.0.2,

 log4j % log4j % 1.2.17

 On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson 
 helena.edel...@datastax.com wrote:

 Hi Shahab,

 I’m just curious, are you explicitly needing to use thrift? Just using
 the connector with spark does not require any thrift dependencies.
 Simply: com.datastax.spark %% spark-cassandra-connector %
 1.1.0-beta1”

 But to your question, you declare the keyspace but also unnecessarily
 repeat the keyspace.table in your select.
 Try this instead:

 val cc = new CassandraSQLContext(sc)
 cc.setKeyspace(“keyspaceName)
 val result = cc.sql(SELECT * FROM tableName”) etc

 - Helena
 @helenaedelson

 On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am using the latest Cassandra-Spark Connector  to access Cassandra
 tables form Spark. While I successfully managed to connect Cassandra using
 CassandraRDD, the similar SparkSQL approach does not work. Here is my code
 for both methods:

 import com.datastax.spark.connector._

 import org.apache.spark.{SparkConf, SparkContext}

 import org.apache.spark.sql._;

 import org.apache.spark.SparkContext._

 import org.apache.spark.sql.catalyst.expressions._

 import com.datastax.spark.connector.cql.CassandraConnector

 import org.apache.spark.sql.cassandra.CassandraSQLContext


   val conf = new SparkConf().setAppName(SomethingElse)

.setMaster(local)

 .set(spark.cassandra.connection.host, localhost)

 val sc: SparkContext = new SparkContext(conf)

  val rdd = sc.cassandraTable(mydb, mytable)  // this works

 But:

 val cc = new CassandraSQLContext(sc)

  cc.setKeyspace(mydb)

  val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )

 println (count :  +  srdd.count) // does not work

 Exception is thrown:

 Exception in thread main
 com.google.common.util.concurrent.UncheckedExecutionException:
 java.util.NoSuchElementException: key not found: mydb3.inverseeventtype

 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)

 at com.google.common.cache.LocalCache.get(LocalCache.java:3934)

 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)

 


 in fact mydb3 is anothery keyspace which I did not tried even to connect
 to it !


 Any idea?


 best,

 /Shahab


 Here is how my SBT looks like:

 libraryDependencies ++= Seq(

 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
 withSources() withJavadoc(),

 org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

 org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

 net.jpountz.lz4 % lz4 % 1.2.0,

 org.apache.thrift % libthrift % 0.9.1

Best way to partition RDD

2014-10-30 Thread shahab
Hi.

I am running an application in the Spark which first loads data from
Cassandra and then performs some map/reduce jobs.

val srdd = sqlContext.sql(select * from mydb.mytable   )
I noticed that the srdd only has one partition . no matter how big is the
data loaded form Cassandra.

So I perform repartition on the RDD , and then I did the map/reduce
functions.

But the main problem is that repartition takes so much time (almost 2
min), which is not acceptable in my use-case. Is there any better way to do
repartitioning?

best,
/Shahab


Re: Best way to partition RDD

2014-10-30 Thread shahab
Hi Helena,

Well... I am just running a toy example, I have one Cassandra node
co-located with the Spark Master and one of Spark Workers, all in one
machine. I have another node which runs the second Spark worker.

/Shahab,


On Thu, Oct 30, 2014 at 6:12 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Hi Shahab,
 -How many spark/cassandra nodes are in your cluster?
 -What is your deploy topology for spark and cassandra clusters? Are they
 co-located?

 - Helena
 @helenaedelson

 On Oct 30, 2014, at 12:16 PM, shahab shahab.mok...@gmail.com wrote:

 Hi.

 I am running an application in the Spark which first loads data from
 Cassandra and then performs some map/reduce jobs.

 val srdd = sqlContext.sql(select * from mydb.mytable   )
 I noticed that the srdd only has one partition . no matter how big is
 the data loaded form Cassandra.

 So I perform repartition on the RDD , and then I did the map/reduce
 functions.

 But the main problem is that repartition takes so much time (almost 2
 min), which is not acceptable in my use-case. Is there any better way to do
 repartitioning?

 best,
 /Shahab





Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread shahab
Hi,

I noticed that the count (of RDD)  in many of my queries is the most time
consuming one as it runs in the driver process rather then done by
parallel worker nodes,

Is there any way to perform count in parallel , at at least parallelize
 it as much as possible?

best,
/Shahab


Re: Best way to partition RDD

2014-10-30 Thread shahab
Thanks Helena, very useful comment,
But is ‘spark.cassandra.input.split.size only effective in Cluster not in
Single node?

best,
/Shahab

On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Shahab,

 Regardless, WRT cassandra and spark when using the spark cassandra
 connector,  ‘spark.cassandra.input.split.size’ passed into the SparkConf
 configures the approx number of Cassandra partitions in a Spark partition
 (default 10).
 No repartitioning should be necessary with what you have below, but I
 don’t know if you are running on one node or a cluster.

 This is a good initial guide:

 https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#configuration-options-for-adjusting-reads

 https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L26-L37

 Cheers,
 Helena
 @helenaedelson

 On Oct 30, 2014, at 1:12 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Hi Shahab,
 -How many spark/cassandra nodes are in your cluster?
 -What is your deploy topology for spark and cassandra clusters? Are they
 co-located?

 - Helena
 @helenaedelson

 On Oct 30, 2014, at 12:16 PM, shahab shahab.mok...@gmail.com wrote:

 Hi.

 I am running an application in the Spark which first loads data from
 Cassandra and then performs some map/reduce jobs.

 val srdd = sqlContext.sql(select * from mydb.mytable   )
 I noticed that the srdd only has one partition . no matter how big is
 the data loaded form Cassandra.

 So I perform repartition on the RDD , and then I did the map/reduce
 functions.

 But the main problem is that repartition takes so much time (almost 2
 min), which is not acceptable in my use-case. Is there any better way to do
 repartitioning?

 best,
 /Shahab






Why RDD is not cached?

2014-10-28 Thread shahab
Hi,

I have a standalone spark , where the executor is set to have 6.3 G memory
, as I am using two workers so in total there 12.6 G memory and 4 cores.

I am trying to cache a RDD with approximate size of 3.2 G, but apparently
it is not cached as neither I can seeBlockManagerMasterActor: Added
rdd_XX in memory  nor  the performance of running the tasks is improved

But, why it is not cached when there is enough memory storage?
I tried with smaller RDDs. 1 or 2 G and it works, at least I could see
BlockManagerMasterActor:
Added rdd_0_1 in memory and improvement in results.

Any idea what I am missing in my settings, or... ?

thanks,
/Shahab


Re: Why RDD is not cached?

2014-10-28 Thread shahab
I used Cache followed by a count on RDD to ensure that caching is
performed.

val rdd = srdd.flatMap(mapProfile_To_Sessions).cache

   val count = rdd.count

//so at this point RDD should be cahed ? right?

On Tue, Oct 28, 2014 at 8:35 AM, Sean Owen so...@cloudera.com wrote:

 Did you just call cache()? By itself it does nothing but once an action
 requires it to be computed it should become cached.
 On Oct 28, 2014 8:19 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I have a standalone spark , where the executor is set to have 6.3 G
 memory , as I am using two workers so in total there 12.6 G memory and 4
 cores.

 I am trying to cache a RDD with approximate size of 3.2 G, but apparently
 it is not cached as neither I can seeBlockManagerMasterActor: Added
 rdd_XX in memory  nor  the performance of running the tasks is improved

 But, why it is not cached when there is enough memory storage?
 I tried with smaller RDDs. 1 or 2 G and it works, at least I could see 
 BlockManagerMasterActor:
 Added rdd_0_1 in memory and improvement in results.

 Any idea what I am missing in my settings, or... ?

 thanks,
 /Shahab




How many executor process does an application receives?

2014-10-28 Thread shahab
Hi,

I am running a stand alone Spark cluster, 2 workers each has 2 cores.
I submit one Spakr application to the cluster, and I monitor the execution
process via UI ( both worker-ip:8081 and master-ip:4040)
There I can see that the application is handled by many Executors, in my
case one worker has 10 executors and the other one only one !!

My question is what is the cardinality relation between executor process
and the submitted Spark application ? I assumed that for each application
there will be one executor process handling the all spark related tasks
(maps, filtering, reduce,...) ?!

best,
/Shahab


Re: How can number of partitions be set in spark-env.sh?

2014-10-28 Thread shahab
Thanks for the useful comment. But I guess this setting applies only when I
use SparkSQL  right=  is there any similar settings for Spark?

best,
/Shahab

On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk wanda_haw...@yahoo.com wrote:

 Is this what are you looking for ?

 In Shark, default reducer number is 1 and is controlled by the property
 mapred.reduce.tasks. Spark SQL deprecates this property in favor of
 spark.sql.shuffle.partitions, whose default value is 200. Users may
 customize this property via SET:

 SET spark.sql.shuffle.partitions=10;
 SELECT page, count(*) c
 FROM logs_last_month_cached
 GROUP BY page ORDER BY c DESC LIMIT 10;


 Spark SQL Programming Guide - Spark 1.1.0 Documentation
 http://spark.apache.org/docs/latest/sql-programming-guide.html






 Spark SQL Programming Guide - Spark 1.1.0 Documentation
 http://spark.apache.org/docs/latest/sql-programming-guide.html
 Spark SQL Programming Guide Overview Getting Started Data Sources RDDs
 Inferring the Schema Using Reflection Programmatically Specifying the
 Schema Parquet Files Loading Data Programmatically
 View on spark.apache.org
 http://spark.apache.org/docs/latest/sql-programming-guide.html
 Preview by Yahoo


   --
  *From:* shahab shahab.mok...@gmail.com
 *To:* user@spark.apache.org
 *Sent:* Tuesday, October 28, 2014 3:20 PM
 *Subject:* How can number of partitions be set in spark-env.sh?

 I am running a stand alone Spark cluster, 2 workers each has 2 cores.
 Apparently, I am loading and processing relatively large chunk of data so
 that I receive task failure   .  As I read from some posts and
 discussions in the mailing list the failures could be related to the large
 size of processing data in the partitions and if I have understood
 correctly I should have smaller partitions (but many of them) ?!

 Is there any way that I can set the number of partitions dynamically in
 spark-env.sh or in the submiited Spark application?


 best,
 /Shahab





What this exception means? ConnectionManager: key already cancelled ?

2014-10-27 Thread shahab
Hi,

I have a stand alone Spark Cluster, where worker and master reside on the
same machine. I submit a job to the cluster, the job is executed for a
while and suddenly I get this exception  with no additional trace.

ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2490dce9
java.nio.channels.CancelledKeyException at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
 at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)


Any idea where should I look for the cause?

best,
/shahab

This following is the part of printout from driver application logs:

14/10/27 15:21:15 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
ip-10-89-32-179.eu-west-1.compute.internal:40479 in memory (size: 3.4 KB,
free: 1565.6 MB)
14/10/27 15:21:15 INFO ContextCleaner: Cleaned broadcast 1
14/10/27 15:21:15 INFO ShuffleBlockManager: Could not find files for
shuffle 1 for deleting
14/10/27 15:21:15 INFO ContextCleaner: Cleaned shuffle 1
14/10/27 15:21:15 INFO ShuffleBlockManager: Could not find files for
shuffle 0 for deleting
14/10/27 15:21:15 INFO ContextCleaner: Cleaned shuffle 0
14/10/27 15:21:15 INFO BlockManagerInfo: Removed taskresult_9 on
ip-10-zz.xx-yy:40479 in memory (size: 24.1 MB, free: 1589.8 MB)
14/10/27 15:21:16 INFO DAGScheduler: Stage 7 (collect at
TimeBenchmarking_SimpleModel.scala:55) finished in 3.209 s
14/10/27 15:21:16 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID
9) in 2640 ms onip-10-zz.xx-yy (1/1)
14/10/27 15:21:16 INFO SparkContext: Job finished: collect at
TimeBenchmarking_SimpleModel.scala:55, took 102.661420511 s
14/10/27 15:21:16 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks
have all completed, from pool
14/10/27 15:21:16 INFO SparkUI: Stopped Spark web UI at
http://ip-10-zz.xx-yy:4040
14/10/27 15:21:16 INFO DAGScheduler: Stopping DAGScheduler
14/10/27 15:21:16 INFO SparkDeploySchedulerBackend: Shutting down all
executors
14/10/27 15:21:16 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down
14/10/27 15:21:16 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(ip-10-zz.xx-yy, 40479)
14/10/27 15:21:16 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(ip-10-zz.xx-yy,40479)
14/10/27 15:21:16 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(ip-10-zz.xx-yy,40479)
14/10/27 15:21:16 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2490dce9
14/10/27 15:21:16 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2490dce9
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/27 15:21:17 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor
stopped!
14/10/27 15:21:17 INFO ConnectionManager: Selector thread was interrupted!


Measuring execution time

2014-10-24 Thread shahab
Hi,

I just wonder if there is any built-in function to get the execution time
for each of the jobs/tasks ? in simple words, how can I find out how much
time is spent on loading/mapping/filtering/reducing part of a job? I can
see printout in the logs but since there is no clear presentation of the
underlying DAG and associated tasks it is hard to find what I am looking
for.

best,
/Shahab


Does SQLSpark support Hive built in functions?

2014-10-22 Thread shahab
Hi,

I just wonder if SparkSQL supports Hive built-in functions (e.g.
from_unixtime) or any of the functions pointed out here : (
https://cwiki.apache.org/confluence/display/Hive/Tutorial)

best,

/Shahab


What's wrong with my spark filter? I get org.apache.spark.SparkException: Task not serializable

2014-10-17 Thread shahab
Hi,

Probably I am missing very simple principle , but something is wrong with
my filter,
i get org.apache.spark.SparkException: Task not serializable expetion.

here is my filter function:
object OBJ {
   def f1(): Boolean = {
 var i = 1;
 for (j-1 to 10) i = i +1;
 true;
   }
}

rdd.filter(row = OBJ.f1())


And when I run, I get the following exception:

org.apache.spark.SparkException: Task not serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.filter(RDD.scala:282)
...
Caused by: java.io.NotSerializableException: org.apache.spark.SparkConf
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
...



best,
/Shahab


Is Array Of Struct supported in json RDDs? is it possible to query this?

2014-10-13 Thread shahab
Hello,

Given the following structure, is it possible to query, e.g. session[0].id ?

In general, is it possible to query  Array Of Struct in json RDDs?

root

 |-- createdAt: long (nullable = true)

 |-- id: string (nullable = true)

 |-- sessions: array (nullable = true)

 ||-- element: struct (containsNull = false)

 |||-- id: string (nullable = true)

 ||   |-- data string (nullable = true)


best,
/Shahab


Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread shahab
Thanks Yin.  I trued HiveQL and and it solved that problem. But now I have
second query requirement :
But since you are main developer behind JSON-Spark integration (I saw your
presentation on youtube Easy JSON Data Manipulation in Spark), is it
possible to perform aggregation kind queries,
for example counting number of attributes (considering that attributes in
schema is presented as array), or any other type of aggregation?

best,
/Shahab

On Mon, Oct 13, 2014 at 4:01 PM, Yin Huai huaiyin@gmail.com wrote:

 Hi Shahab,

 Can you try to use HiveContext? Its should work in 1.1. For SQLContext,
 this issues was not fixed in 1.1 and you need to use master branch at the
 moment.

 Thanks,

 Yin

 On Sun, Oct 12, 2014 at 5:20 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

  Apparently is it is possible to query nested json using spark SQL, but ,
 mainly due to lack of proper documentation/examples, I did not manage to
 make it working. I do appreciate if you could point me to any example or
 help with this issue,

 Here is my code:

   val anotherPeopleRDD = sc.parallelize(

{

 attributes: [

 {

 data: {

 gender: woman

 },

 section: Economy,

 collectApp: web,

 id: 1409064792512

 }

 ]

 } :: Nil)

   val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)

   anotherPeople.registerTempTable(people)

val query_people = sqlContext.sql(select attributes[0].collectApp
 from people)

query_people.foreach(println)

 But instead of getting Web as print out, I am getting the following:

 [[web,[woman],1409064792512, Economy]]



 thanks,

 /shahab






Nested Query using SparkSQL 1.1.0

2014-10-12 Thread shahab
Hi,

 Apparently is it is possible to query nested json using spark SQL, but ,
mainly due to lack of proper documentation/examples, I did not manage to
make it working. I do appreciate if you could point me to any example or
help with this issue,

Here is my code:

  val anotherPeopleRDD = sc.parallelize(

   {

attributes: [

{

data: {

gender: woman

},

section: Economy,

collectApp: web,

id: 1409064792512

}

]

} :: Nil)

  val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)

  anotherPeople.registerTempTable(people)

   val query_people = sqlContext.sql(select attributes[0].collectApp from
people)

   query_people.foreach(println)

But instead of getting Web as print out, I am getting the following:

[[web,[woman],1409064792512, Economy]]



thanks,

/shahab


Spark Usecase

2014-06-04 Thread Shahab Yunus
Hello All.

I have a newbie question.

We have a use case where huge amount of data will be coming in streams or
micro-batches of streams and we want to process these streams according to
some business logic. We don't have to provide extremely low latency
guarantees but batch M/R will still be slow.

Now the business logic is such that at the time of emitting the data, we
might have to hold on to some tuples until we get more information. This
'more' information is essentially will be coming in streams of future
streams.

You can say that this is kind of *word count* use case where we have
to *aggregate
and maintain state across batches of streams.* One thing different here is
that we might have to* maintain the state or data for a day or two* until
rest of the data comes in and then we can complete our output.

1- Questions is that is such is use cases supported in Spark and/or Spark
Streaming?
2- Will we be able to persist partially aggregated data until the rest of
the information comes in later in time? I am mentioning *persistence* here
that given that the delay can be spanned over a day or two we won't want to
keep the partial data in memory for so long.

I know this can be done in Storm but I am really interested in Spark
because of its close integration with Hadoop. We might not even want to use
Spark Streaming (which is more of a direct comparison with Storm/Trident)
given our  application does not have to be real-time in split-second.

Feel free to direct me to any document or resource.

Thanks a lot.

Regards,
Shahab