Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
Hi Koert,

As far as I can see you are using derby:

 Using direct SQL, underlying DB is DERBY

not mysql, which is used for the metastore. That means, spark couldn't find
hive-site.xml on your classpath. Can you check that, please?

Thanks, Alex.

On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers  wrote:

> has anyone successfully connected to hive metastore using spark 1.6.0? i
> am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and
> launching spark with yarn.
>
> this is what i see in logs:
> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore with
> URI thrift://metastore.mycompany.com:9083
> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
>
> and then a little later:
>
> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
> version 1.2.1
> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
> 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
> called
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
> datanucleus.cache.level2 unknown - will be ignored
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
> present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
> present in CLASSPATH (or one of dependencies)
> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
> hive.server2.enable.impersonation does not exist
> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object pin
> classes with
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
> underlying DB is DERBY
> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:194)
>   at
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
>   at
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
>   at
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
>   at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:97)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.spark.repl.Main$.createSQLContext(Main.scala:89)
>   ... 47 elided
> Caused by: java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at
> 

spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Alexandr Dzhagriev
Hello all,

I looked through the cassandra spark integration (
https://github.com/datastax/spark-cassandra-connector) and couldn't find
any usages of the BulkOutputWriter (
http://www.datastax.com/dev/blog/bulk-loading) - an awesome tool for
creating local sstables, which could be later uploaded to a cassandra
cluster.  Seems like (sorry if I'm wrong), it uses just bulk insert
statements. So, my question is: does anybody know if there are any plans to
support bulk loading?

Cheers, Alex.


Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
I'm using spark 1.6.0, hive 1.2.1 and there is just one property in the
hive-site.xml hive.metastore.uris Works for me. Can you check in the logs,
that when the HiveContext is created it connects to the correct uri and
doesn't use derby.

Cheers, Alex.

On Tue, Feb 9, 2016 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote:

> hey thanks. hive-site is on classpath in conf directory
>
> i currently got it to work by changing this hive setting in hive-site.xml:
> hive.metastore.schema.verification=true
> to
> hive.metastore.schema.verification=false
>
> this feels like a hack, because schema verification is a good thing i
> would assume?
>
> On Tue, Feb 9, 2016 at 3:25 PM, Alexandr Dzhagriev <dzh...@gmail.com>
> wrote:
>
>> Hi Koert,
>>
>> As far as I can see you are using derby:
>>
>>  Using direct SQL, underlying DB is DERBY
>>
>> not mysql, which is used for the metastore. That means, spark couldn't
>> find hive-site.xml on your classpath. Can you check that, please?
>>
>> Thanks, Alex.
>>
>> On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> has anyone successfully connected to hive metastore using spark 1.6.0? i
>>> am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and
>>> launching spark with yarn.
>>>
>>> this is what i see in logs:
>>> 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore
>>> with URI thrift://metastore.mycompany.com:9083
>>> 16/02/09 14:49:12 INFO hive.metastore: Connected to metastore.
>>>
>>> and then a little later:
>>>
>>> 16/02/09 14:49:34 INFO hive.HiveContext: Initializing execution hive,
>>> version 1.2.1
>>> 16/02/09 14:49:34 INFO client.ClientWrapper: Inspected Hadoop version:
>>> 2.6.0-cdh5.4.4
>>> 16/02/09 14:49:34 INFO client.ClientWrapper: Loaded
>>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.4.4
>>> 16/02/09 14:49:34 WARN conf.HiveConf: HiveConf of name
>>> hive.server2.enable.impersonation does not exist
>>> 16/02/09 14:49:35 INFO metastore.HiveMetaStore: 0: Opening raw store
>>> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>>> 16/02/09 14:49:35 INFO metastore.ObjectStore: ObjectStore, initialize
>>> called
>>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>>> 16/02/09 14:49:35 INFO DataNucleus.Persistence: Property
>>> datanucleus.cache.level2 unknown - will be ignored
>>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 16/02/09 14:49:35 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 16/02/09 14:49:37 WARN conf.HiveConf: HiveConf of name
>>> hive.server2.enable.impersonation does not exist
>>> 16/02/09 14:49:37 INFO metastore.ObjectStore: Setting MetaStore object
>>> pin classes with
>>> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
>>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:38 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 16/02/09 14:49:40 INFO metastore.MetaStoreDirectSql: Using direct SQL,
>>> underlying DB is DERBY
>>> 16/02/09 14:49:40 INFO metastore.ObjectStore: Initialized ObjectStore
>>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>>> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>>>   at
>>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>>>   at
>>> org.apache.spark.sq

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-02-03 Thread Alexandr Dzhagriev
Hi Sebastian,

Do you have any updates on the issue? I faced with pretty the same problem
and disabling kryo + raising the spark.network.timeout up to 600s helped.
So for my job it takes about 5 minutes to broadcast the variable (~5GB in
my case) but then it's fast. I mean much faster than shuffling with usual
join anyway. Hope it helps.


Thanks, Alex.


Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
Hi,

That's another thing: that the Record case class should be outside. I ran
it as spark-submit.

Thanks, Alex.

On Mon, Feb 1, 2016 at 6:41 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Running your sample in spark-shell built in master branch, I got:
>
> scala> val dataset = sc.parallelize(Seq(RecordExample(1, "apple"),
> RecordExample(2, "orange"))).toDS()
> org.apache.spark.sql.AnalysisException: Unable to generate an encoder for
> inner class `RecordExample` without access to the scope that this class was
> defined in. Try moving this class out of its parent class.;
>   at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$3.applyOrElse(ExpressionEncoder.scala:316)
>   at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$3.applyOrElse(ExpressionEncoder.scala:312)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>   at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:251)
>   at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:312)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:80)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:91)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:488)
>   at
> org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:71)
>   ... 53 elided
>
> On Mon, Feb 1, 2016 at 9:09 AM, Alexandr Dzhagriev <dzh...@gmail.com>
> wrote:
>
>> Hello again,
>>
>> Also I've tried the following snippet with concat_ws:
>>
>> val dataset = sc.parallelize(Seq(
>>   RecordExample(1, "apple"),
>>   RecordExample(1, "banana"),
>>   RecordExample(2, "orange"))
>> ).toDS().groupBy($"a").agg(concat_ws(",", $"b").as[String])
>>
>> dataset.take(10).foreach(println)
>>
>>
>> which also fails
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> expression 'b' is neither present in the group by, nor is it an aggregate
>> function. Add to group by or wrap in first() (or first_value) if you don't
>> care which value you get.;
>> at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>> at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$
>> 1.org
>> $apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:130)
>>
>> Thanks, Alex.
>>
>> On Mon, Feb 1, 2016 at 6:03 PM, Alexandr Dzhagriev <dzh...@gmail.com>
>> wrote:
>>
>>> Hi Ted,
>>>
>>> That doesn't help neither as one method delegates to another as far as I
>>> can see:
>>>
>>> def collect_list(columnName: String): Column = 
>>> collect_list(Column(columnName))
>>>
>>>
>>> Thanks, Alex
>>>
>>> On Mon, Feb 1, 2016 at 5:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> bq. agg(collect_list("b")
>>>>
>>>> Have you tried:
>>>>
>>>> agg(collect_list($"b")
>>>>
>>>> On Mon, Feb 1, 2016 at 8:50 AM, Alexandr Dzhagriev <dzh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm trying to run the following example code:
>>>>>
>>>>> import org.apache.spark.sql.hive.HiveContext
>>>>> import org.apache.spark.{SparkContext, SparkConf}
>>>>> import org.apache.spark.sql.functions._
>>>>>
>>>>>
>>>>> case class RecordExample(a: Int, b: String)
>>>>>
>>>>> object ArrayExample {
>>>>>   def main(args: Array[String]) {
>>>>> val conf = new SparkConf()
>>>>>
>>>>> val sc = new SparkContext(conf)
>>>>> val sqlContext = new HiveContext(sc)
>>>>>
>>>>> import sqlContext.implicits._
>>>>>
>>>>> val dataset = sc.parallelize(Seq(RecordExample(1, &qu

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
Hi Ted,

That doesn't help neither as one method delegates to another as far as I
can see:

def collect_list(columnName: String): Column = collect_list(Column(columnName))


Thanks, Alex

On Mon, Feb 1, 2016 at 5:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. agg(collect_list("b")
>
> Have you tried:
>
> agg(collect_list($"b")
>
> On Mon, Feb 1, 2016 at 8:50 AM, Alexandr Dzhagriev <dzh...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I'm trying to run the following example code:
>>
>> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.{SparkContext, SparkConf}
>> import org.apache.spark.sql.functions._
>>
>>
>> case class RecordExample(a: Int, b: String)
>>
>> object ArrayExample {
>>   def main(args: Array[String]) {
>> val conf = new SparkConf()
>>
>> val sc = new SparkContext(conf)
>> val sqlContext = new HiveContext(sc)
>>
>> import sqlContext.implicits._
>>
>> val dataset = sc.parallelize(Seq(RecordExample(1, "apple"), 
>> RecordExample(2, "orange"))).toDS()
>>
>> dataset.groupBy($"a").agg(collect_list("b").as[List[String]])
>>
>> dataset.collect()
>>
>>   }
>>
>> }
>>
>>
>> and it fails with the following (please see the whole stack trace below):
>>
>>  Exception in thread "main" java.lang.ClassCastException:
>> org.apache.spark.sql.types.ArrayType cannot be cast to
>> org.apache.spark.sql.types.StructType
>>
>>
>> Could please someone point me to the proper way to do that or confirm
>> it's a bug?
>>
>> Thank you and here is the whole stacktrace:
>>
>> Exception in thread "main" java.lang.ClassCastException:
>> org.apache.spark.sql.types.ArrayType cannot be cast to
>> org.apache.spark.sql.types.StructType
>> at org.apache.spark.sql.catalyst.expressions.GetStructField.org
>> $apache$spark$sql$catalyst$expressions$GetStructField$$field$lzycompute(complexTypeExtractors.scala:107)
>> at org.apache.spark.sql.catalyst.expressions.GetStructField.org
>> $apache$spark$sql$catalyst$expressions$GetStructField$$field(complexTypeExtractors.scala:107)
>> at
>> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:109)
>> at
>> org.apache.spark.sql.catalyst.analysis.ResolveUpCast$$anonfun$apply$24.applyOrElse(Analyzer.scala:1214)
>> at
>> org.apache.spark.sql.catalyst.analysis.ResolveUpCast$$anonfun$apply$24.applyOrElse(Analyzer.scala:1211)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
>> at
>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>> at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>> at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>> at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:248)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(T

Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
Hello,

I'm trying to run the following example code:

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions._


case class RecordExample(a: Int, b: String)

object ArrayExample {
  def main(args: Array[String]) {
val conf = new SparkConf()

val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)

import sqlContext.implicits._

val dataset = sc.parallelize(Seq(RecordExample(1, "apple"),
RecordExample(2, "orange"))).toDS()

dataset.groupBy($"a").agg(collect_list("b").as[List[String]])

dataset.collect()

  }

}


and it fails with the following (please see the whole stack trace below):

 Exception in thread "main" java.lang.ClassCastException:
org.apache.spark.sql.types.ArrayType cannot be cast to
org.apache.spark.sql.types.StructType


Could please someone point me to the proper way to do that or confirm it's
a bug?

Thank you and here is the whole stacktrace:

Exception in thread "main" java.lang.ClassCastException:
org.apache.spark.sql.types.ArrayType cannot be cast to
org.apache.spark.sql.types.StructType
at org.apache.spark.sql.catalyst.expressions.GetStructField.org
$apache$spark$sql$catalyst$expressions$GetStructField$$field$lzycompute(complexTypeExtractors.scala:107)
at org.apache.spark.sql.catalyst.expressions.GetStructField.org
$apache$spark$sql$catalyst$expressions$GetStructField$$field(complexTypeExtractors.scala:107)
at
org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:109)
at
org.apache.spark.sql.catalyst.analysis.ResolveUpCast$$anonfun$apply$24.applyOrElse(Analyzer.scala:1214)
at
org.apache.spark.sql.catalyst.analysis.ResolveUpCast$$anonfun$apply$24.applyOrElse(Analyzer.scala:1211)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:248)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:248)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
at

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
Hello again,

Also I've tried the following snippet with concat_ws:

val dataset = sc.parallelize(Seq(
  RecordExample(1, "apple"),
  RecordExample(1, "banana"),
  RecordExample(2, "orange"))
).toDS().groupBy($"a").agg(concat_ws(",", $"b").as[String])

dataset.take(10).foreach(println)


which also fails

Exception in thread "main" org.apache.spark.sql.AnalysisException:
expression 'b' is neither present in the group by, nor is it an aggregate
function. Add to group by or wrap in first() (or first_value) if you don't
care which value you get.;
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$
1.org
$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:130)

Thanks, Alex.

On Mon, Feb 1, 2016 at 6:03 PM, Alexandr Dzhagriev <dzh...@gmail.com> wrote:

> Hi Ted,
>
> That doesn't help neither as one method delegates to another as far as I
> can see:
>
> def collect_list(columnName: String): Column = 
> collect_list(Column(columnName))
>
>
> Thanks, Alex
>
> On Mon, Feb 1, 2016 at 5:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. agg(collect_list("b")
>>
>> Have you tried:
>>
>> agg(collect_list($"b")
>>
>> On Mon, Feb 1, 2016 at 8:50 AM, Alexandr Dzhagriev <dzh...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I'm trying to run the following example code:
>>>
>>> import org.apache.spark.sql.hive.HiveContext
>>> import org.apache.spark.{SparkContext, SparkConf}
>>> import org.apache.spark.sql.functions._
>>>
>>>
>>> case class RecordExample(a: Int, b: String)
>>>
>>> object ArrayExample {
>>>   def main(args: Array[String]) {
>>> val conf = new SparkConf()
>>>
>>> val sc = new SparkContext(conf)
>>> val sqlContext = new HiveContext(sc)
>>>
>>> import sqlContext.implicits._
>>>
>>> val dataset = sc.parallelize(Seq(RecordExample(1, "apple"), 
>>> RecordExample(2, "orange"))).toDS()
>>>
>>> dataset.groupBy($"a").agg(collect_list("b").as[List[String]])
>>>
>>> dataset.collect()
>>>
>>>   }
>>>
>>> }
>>>
>>>
>>> and it fails with the following (please see the whole stack trace below):
>>>
>>>  Exception in thread "main" java.lang.ClassCastException:
>>> org.apache.spark.sql.types.ArrayType cannot be cast to
>>> org.apache.spark.sql.types.StructType
>>>
>>>
>>> Could please someone point me to the proper way to do that or confirm
>>> it's a bug?
>>>
>>> Thank you and here is the whole stacktrace:
>>>
>>> Exception in thread "main" java.lang.ClassCastException:
>>> org.apache.spark.sql.types.ArrayType cannot be cast to
>>> org.apache.spark.sql.types.StructType
>>> at org.apache.spark.sql.catalyst.expressions.GetStructField.org
>>> $apache$spark$sql$catalyst$expressions$GetStructField$$field$lzycompute(complexTypeExtractors.scala:107)
>>> at org.apache.spark.sql.catalyst.expressions.GetStructField.org
>>> $apache$spark$sql$catalyst$expressions$GetStructField$$field(complexTypeExtractors.scala:107)
>>> at
>>> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:109)
>>> at
>>> org.apache.spark.sql.catalyst.analysis.ResolveUpCast$$anonfun$apply$24.applyOrElse(Analyzer.scala:1214)
>>> at
>>> org.apache.spark.sql.catalyst.analysis.ResolveUpCast$$anonfun$apply$24.applyOrElse(Analyzer.scala:1211)
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
>>> at
>>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:248)
>>> at
>>> org.apache.spark.sql

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Alexandr Dzhagriev
Good to know, thanks.

On Mon, Feb 1, 2016 at 6:57 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Got around the previous error by adding:
>
> scala> implicit val kryoEncoder = Encoders.kryo[RecordExample]
> kryoEncoder: org.apache.spark.sql.Encoder[RecordExample] = class[value[0]:
> binary]
>
> On Mon, Feb 1, 2016 at 9:55 AM, Alexandr Dzhagriev <dzh...@gmail.com>
> wrote:
>
>> Hi,
>>
>> That's another thing: that the Record case class should be outside. I ran
>> it as spark-submit.
>>
>> Thanks, Alex.
>>
>> On Mon, Feb 1, 2016 at 6:41 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Running your sample in spark-shell built in master branch, I got:
>>>
>>> scala> val dataset = sc.parallelize(Seq(RecordExample(1, "apple"),
>>> RecordExample(2, "orange"))).toDS()
>>> org.apache.spark.sql.AnalysisException: Unable to generate an encoder
>>> for inner class `RecordExample` without access to the scope that this class
>>> was defined in. Try moving this class out of its parent class.;
>>>   at
>>> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$3.applyOrElse(ExpressionEncoder.scala:316)
>>>   at
>>> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$3.applyOrElse(ExpressionEncoder.scala:312)
>>>   at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>>>   at
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262)
>>>   at
>>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>>>   at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>>>   at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:251)
>>>   at
>>> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:312)
>>>   at org.apache.spark.sql.Dataset.(Dataset.scala:80)
>>>   at org.apache.spark.sql.Dataset.(Dataset.scala:91)
>>>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:488)
>>>   at
>>> org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:71)
>>>   ... 53 elided
>>>
>>> On Mon, Feb 1, 2016 at 9:09 AM, Alexandr Dzhagriev <dzh...@gmail.com>
>>> wrote:
>>>
>>>> Hello again,
>>>>
>>>> Also I've tried the following snippet with concat_ws:
>>>>
>>>> val dataset = sc.parallelize(Seq(
>>>>   RecordExample(1, "apple"),
>>>>   RecordExample(1, "banana"),
>>>>   RecordExample(2, "orange"))
>>>> ).toDS().groupBy($"a").agg(concat_ws(",", $"b").as[String])
>>>>
>>>> dataset.take(10).foreach(println)
>>>>
>>>>
>>>> which also fails
>>>>
>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>> expression 'b' is neither present in the group by, nor is it an aggregate
>>>> function. Add to group by or wrap in first() (or first_value) if you don't
>>>> care which value you get.;
>>>> at
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>>>> at
>>>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>>>> at
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$
>>>> 1.org
>>>> $apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:130)
>>>>
>>>> Thanks, Alex.
>>>>
>>>> On Mon, Feb 1, 2016 at 6:03 PM, Alexandr Dzhagriev <dzh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ted,
>>>>>
>>>>> That doesn't help neither as one method delegates to another as far as
>>>>> I can see:
>>>>>
>>>>> def collect_list(columnName: String): Column = 
>>>>> collect_list(Column(columnName))
>>>>>
>>>>>
>>>>> Thanks, Alex
>>>>>
>>>>> On Mon, Feb 1, 2016 at 5:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>
>>>>>> bq. agg(collect_list("b")
>>>>>>
>>>>>> Have you tried:
>>>>>

Re: Stream S3 server to Cassandra

2016-01-28 Thread Alexandr Dzhagriev
Hello Sateesh,

I think you can use a file stream, e.g.

streamingContext.fileStream[KeyClass, ValueClass,
InputFormatClass](dataDirectory)

to create a stream and then process the RDDs as you are doing now.

Thanks, Alex.


On Thu, Jan 28, 2016 at 10:56 AM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> Hello Anyone... please help me to how to Stream the XML files from S3
> server to cassandra db using Spark Streaming java. presently iam using
> Spark core to do that job..but problem is i have to to run for every 15
> mints.. thats why iam looking for Spark Streaming.
>
>