[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296964#comment-15296964
 ] 

Michael Armbrust commented on SPARK-15489:
--

Also, does this problem exist in the 2.0 preview?

> Dataset kryo encoder fails on Collections$UnmodifiableCollection
> 
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15489) Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296963#comment-15296963
 ] 

Michael Armbrust commented on SPARK-15489:
--

Is your registration making into the instance of kryo that we use for encoders? 
 Is possible we aren't propagating stuff correctly.

> Dataset kryo encoder fails on Collections$UnmodifiableCollection
> 
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
>
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust
Can you open a JIRA?

On Sun, May 22, 2016 at 2:50 PM, Amit Sela  wrote:

> I've been using Encoders with Kryo to support encoding of generically
> typed Java classes, mostly with success, in the following manner:
>
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
>
> But at some point I got a decoding exception "Caused by:
> java.lang.UnsupportedOperationException
> at java.util.Collections$UnmodifiableCollection.add..."
>
> This seems to be because of Guava's `ImmutableList`.
>
> I tried registering `UnmodifiableCollectionsSerializer` and `
> ImmutableListSerializer` from: https://github.com/magro/kryo-serializers
> but it didn't help.
>
> Ideas ?
>
> Thanks,
> Amit
>


Re: Dataset API and avro type

2016-05-23 Thread Michael Armbrust
if you are using the kryo encoder, you can only use it to to map to/from
kryo encoded binary data.  This is because spark does not understand kryo's
encoding, its just using it as an opaque blob of bytes.

On Mon, May 23, 2016 at 1:28 AM, Han JU  wrote:

> Just one more question: does Dataset suppose to be able to cast data to an
> avro type? For a very simple format (a string and a long), I can cast it to
> a tuple or case class, but not an avro type (also contains only a string
> and a long).
>
> The error is like this for this very simple type:
>
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0,
> string])) null else input[0, string].toString, if (isnull(input[1,
> bigint])) null else input[1, bigint],
> StructField(auctionId,StringType,true), StructField(ts,LongType,true)),
> auctionId#0, ts#1L) AS #2]   Project [createexternalrow(if
> (isnull(auctionId#0)) null else auctionId#0.toString, if (isnull(ts#1L))
> null else ts#1L, StructField(auctionId,StringType,true),
> StructField(ts,LongType,true)) AS #2]
>  +- LocalRelation [auctionId#0,ts#1L]
>
>
> +- LocalRelation
> [auctionId#0,ts#1L]
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to
> map struct to Tuple1, but failed as the number
> of fields does not line up.
>  - Input schema: struct
>  - Target schema: struct;
> at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org
> $apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:267)
> at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:281)
> at org.apache.spark.sql.Dataset.(Dataset.scala:201)
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
> at org.apache.spark.sql.Dataset.as(Dataset.scala:366)
> at Datasets$.delayedEndpoint$Datasets$1(Datasets.scala:35)
> at Datasets$delayedInit$body.apply(Datasets.scala:23)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.App$class.main(App.scala:76)
> at Datasets$.main(Datasets.scala:23)
> at Datasets.main(Datasets.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
>
> 2016-05-22 22:02 GMT+02:00 Michael Armbrust :
>
>> That's definitely a bug.  If you can come up with a small reproduction it
>> would be great if you could open a JIRA.
>> On May 22, 2016 12:21 PM, "Han JU"  wrote:
>>
>>> Hi Michael,
>>>
>>> The error is like this under 2.0.0-preview. In 1.6.1 the error is very
>>> similar if not exactly the same.
>>> The file is a parquet file containing avro objects.
>>>
>>> Thanks!
>>>
>>> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception:
>>> failed to compile: org.codehaus.commons.compiler.CompileException: File
>>> 'generated.java', Line 25, Column 160: No applicable constructor/method
>>> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
>>> candidates are: "public static java.nio.ByteBuffer
>>> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
>>> java.nio.ByteBuffer.wrap(byte[], int, int)"
>>> /* 001 */
>>> /* 002 */ public java.lang.Object generate(Object[] references) {
>>> /* 003 */   return new SpecificSafeProjection(references);
>>> /* 004 */ }
>>> /* 005 */
>>> /* 006 */ class SpecificSafeProjection extends
>>> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>>> /* 007 */
>>> /* 008 */   private Object[] references;
>>> /* 009 */   private MutableRow mutableRow;
>>> /* 010 */   private org.apache.spark.serializer.KryoSerializerInstance
>>> serializer;
>>> /* 011 */
>>> /* 012 */
>>> /* 013 */   public SpecificSafeProjection(Object[] references) {
>>> /* 014 */ this.

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Michael Armbrust
We did turn on travis a few years ago, but ended up turning it off because
it was failing (I believe because of insufficient resources) which was
confusing for developers.  I wouldn't be opposed to turning it on if it
provides more/faster signal, but its not obvious to me that it would.  In
particular, do we know that given the rate PRs are created if we will hit
rate limits?

Really my main feedback is, if the java linter is important we should
probably have it as part of the canonical build process.  I worry about
having more than one set of CI infrastructure to maintain.

On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun  wrote:

> Thank you, Steve and Hyukjin.
>
> And, don't worry, Ted.
>
> Travis launches new VMs for every PR.
>
> Apache Spark repository uses the following setting.
>
> VM: Google Compute Engine
> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
> CPU: ~2 CORE
> RAM: 7.5GB
>
> FYI, you can find more information about this here.
>
> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>
> Dongjoon.
>
>
>
> On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:
>
>> Do you know if more than one PR would be verified on the same machine ?
>>
>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>> have conflict.
>>
>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for feedback. Sure, correctly, that's the reason why the
>>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>>
>>> In addition, that's the same reason why contributors are reluctant to
>>> run `lint-java` and causes breaking on JDK7 builds.
>>>
>>> Such a tedious and time-consuming job should be done by CI without human
>>> interventions.
>>>
>>> By the way, why do you think we need to wait for that? We should not
>>> wait for any CIs, we should continue our own work.
>>>
>>> My proposal isn't for making you wait to watch the result. There are two
>>> use cases I want for us to focus here.
>>>
>>> Case 1: When you make a PR to Spark PR queue.
>>>
>>> Travis CI will finish before SparkPullRequestBuilder.
>>> We will run the followings in parallel mode.
>>>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no
>>> Java Linter)
>>>  2. Travis: JDK7 + mvn build + Java Linter
>>>  3. Travis: JDK8 + mvn build + Java Linter
>>>  As we know, 1 is the longest time-consuming one which have lots of
>>> works (except maven building or lint-  java). You don't need to wait more
>>> in many cases. Yes, in many cases, not all the cases.
>>>
>>>
>>> Case 2: When you prepare a PR on your branch.
>>>
>>> If you are at the final commit (maybe already-squashed), just go to
>>> case 1.
>>>
>>> However, usually, we makes lots of commits locally while making
>>> preparing our PR.
>>> And, finally we squashed them into one and send a PR to Spark.
>>> I mean you can use Travis CI during preparing your PRs.
>>> Again, don't wait for Travis CI. Just push it sometime or at every
>>> commit, and continue your work.
>>>
>>> At the final stage when you finish your coding, squash your commits
>>> into one,
>>> and amend your commit title or messages, see the Travis CI.
>>> Or, you can monitor Travis CI result on status menu bar.
>>> If it shows green icon, you have nothing to do.
>>>
>>>https://docs.travis-ci.com/user/apps/
>>>
>>> To sum up, I think we don't need to wait for any CIs. It's like an
>>> email. `Send and back to work.`
>>>
>>> Dongjoon.
>>>
>>>
>>> On Sun, May 22, 2016 at 8:32 PM, Ted Yu  wrote:
>>>
 Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.

 Maybe not everyone is willing to wait that long.

 On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun 
 wrote:

> Oh, Sure. My bad!
>
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>
> Thank you, Ted.
>
> Dongjoon.
>
> On Sun, May 22, 2016 at 1:29 PM, Ted Yu  wrote:
>
>> The following line was repeated twice:
>>
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>
>> Did you intend to cover JDK 8 ?
>>
>> Cheers
>>
>> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> I want to propose the followings.
>>>
>>> - Turn on Travis CI for Apache Spark PR queue.
>>> - Recommend this for contributors, too
>>>
>>> Currently, Spark provides Travis CI configuration file to help
>>> contributors check Scala/Java style conformance and JDK7/8 compilation
>>> easily during their preparing pull requests. Please note that it's only
>>> about static analysis.
>>>
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>> Scalastyle is included in the ste

[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null

2016-05-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296805#comment-15296805
 ] 

Michael Armbrust commented on SPARK-15140:
--

I don't think you should ever get a null row back.  So I think all the fields 
being null makes sense.

> ensure input object of encoder is not null
> --
>
> Key: SPARK-15140
> URL: https://issues.apache.org/jira/browse/SPARK-15140
> Project: Spark
>  Issue Type: Improvement
>Reporter: Wenchen Fan
>
> Current we assume the input object for encoder won't be null, but we don't 
> check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, 
> in 2.0 this will return Array("a", null).
> We should define this behaviour more clearly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15471) ScalaReflection cleanup

2016-05-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15471.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13250
[https://github.com/apache/spark/pull/13250]

> ScalaReflection cleanup
> ---
>
> Key: SPARK-15471
> URL: https://issues.apache.org/jira/browse/SPARK-15471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Dataset API and avro type

2016-05-20 Thread Michael Armbrust
What is the error?  I would definitely expect it to work with kryo at least.

On Fri, May 20, 2016 at 2:37 AM, Han JU  wrote:

> Hello,
>
> I'm looking at the Dataset API in 1.6.1 and also in upcoming 2.0. However
> it does not seems to work with Avro data types:
>
>
> object Datasets extends App {
>   val conf = new SparkConf()
>   conf.setAppName("Dataset")
>   conf.setMaster("local[2]")
>   conf.setIfMissing("spark.serializer", classOf[KryoSerializer].getName)
>   conf.setIfMissing("spark.kryo.registrator",
> classOf[DatasetKryoRegistrator].getName)
>
>   val sc = new SparkContext(conf)
>   val sql = new SQLContext(sc)
>   import sql.implicits._
>
>   implicit val encoder = Encoders.kryo[MyAvroType]
>   val data = sql.read.parquet("path/to/data").as[MyAvroType]
>
>   var c = 0
>   // BUG here
>   val sizes = data.mapPartitions { iter =>
> List(iter.size).iterator
>   }.collect().toList
>
>   println(c)
> }
>
>
> class DatasetKryoRegistrator extends KryoRegistrator {
>   override def registerClasses(kryo: Kryo) {
> kryo.register(
>   classOf[MyAvroType],
> AvroSerializer.SpecificRecordBinarySerializer[MyAvroType])
>   }
> }
>
>
> I'm using chill-avro's kryo servirilizer for avro types and I've tried
> `Encoders.kyro` as well as `bean` or `javaSerialization`, but none of them
> works. The errors seems to be that the generated code does not compile with
> janino.
>
> Tested in 1.6.1 and the 2.0.0-preview. Any idea?
>
> --
> *JU Han*
>
> Software Engineer @ Teads.tv
>
> +33 061960
>


Re: Wide Datasets (v1.6.1)

2016-05-20 Thread Michael Armbrust
>
> I can provide an example/open a Jira if there is a chance this will be
> fixed.
>

Please do!  Ping me on it.

Michael


[jira] [Resolved] (SPARK-15190) Support using SQLUserDefinedType for case classes

2016-05-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15190.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12965
[https://github.com/apache/spark/pull/12965]

> Support using SQLUserDefinedType for case classes
> -
>
> Key: SPARK-15190
> URL: https://issues.apache.org/jira/browse/SPARK-15190
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Right now inferring the schema for case classes happens before searching the 
> SQLUserDefinedType annotation, so the SQLUserDefinedType annotation for case 
> classes doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15308) RowEncoder should preserve nested column name.

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15308:
-
Target Version/s: 2.0.0

> RowEncoder should preserve nested column name.
> --
>
> Key: SPARK-15308
> URL: https://issues.apache.org/jira/browse/SPARK-15308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> The following code generates wrong schema:
> {code}
> val schema = new StructType().add(
>   "struct",
>   new StructType()
> .add("i", IntegerType, nullable = false)
> .add(
>   "s",
>   new StructType().add("int", IntegerType, nullable = false),
>   nullable = false),
>   nullable = false)
> val ds = sqlContext.range(10).map(l => Row(l, Row(l)))(RowEncoder(schema))
> ds.printSchema()
> {code}
> This should print as follows:
> {code}
> root
>  |-- struct: struct (nullable = false)
>  ||-- i: integer (nullable = false)
>  ||-- s: struct (nullable = false)
>  |||-- int: integer (nullable = false)
> {code}
> but the result is:
> {code}
> root
>  |-- struct: struct (nullable = false)
>  ||-- col1: integer (nullable = false)
>  ||-- col2: struct (nullable = false)
>  |||-- col1: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15313) EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15313:
-
Target Version/s: 2.0.0

> EmbedSerializerInFilter rule should keep exprIds of output of surrounded 
> SerializeFromObject.
> -
>
> Key: SPARK-15313
> URL: https://issues.apache.org/jira/browse/SPARK-15313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> The following code:
> {code}
> val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
> ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
> {code}
> throws an Exception:
> {noformat}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _1#420
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:254)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>  at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>  at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
>  at 
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>  at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>  at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>  at 
> org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:79)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)

[jira] [Updated] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15416:
-
Assignee: Shixiong Zhu

> Display a better message for not finding classes removed in Spark 2.0
> -
>
> Key: SPARK-15416
> URL: https://issues.apache.org/jira/browse/SPARK-15416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> We removed some classes in Spark 2.0. If the user uses an incompatible 
> library, he may see ClassNotFoundException. It's better to give an 
> instruction to ask people using a correct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15416.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13201
[https://github.com/apache/spark/pull/13201]

> Display a better message for not finding classes removed in Spark 2.0
> -
>
> Key: SPARK-15416
> URL: https://issues.apache.org/jira/browse/SPARK-15416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
> Fix For: 2.0.0
>
>
> We removed some classes in Spark 2.0. If the user uses an incompatible 
> library, he may see ClassNotFoundException. It's better to give an 
> instruction to ask people using a correct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Michael Armbrust
>
> 1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)”
> doesn’t work because “HiveContext not a member of
> org.apache.spark.sql.hive”  I checked the documentation, and it looks like
> it should still work for spark-2.0.0-preview-bin-hadoop2.7.tgz
>

HiveContext has been deprecated and moved to a 1.x compatibility package,
which you'll need to include explicitly.  Docs have not been updated yet.


> 2. I also tried the new spark session, ‘spark.table(“db.table”)’, it fails
> with a HDFS permission denied can’t write to “/user/hive/warehouse”
>

Where are the HDFS configurations located?  We might not be propagating
them correctly any more.


[jira] [Commented] (SPARK-12141) Use Jackson to serialize all events when writing event log

2016-05-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291971#comment-15291971
 ] 

Michael Armbrust commented on SPARK-12141:
--

My issue with the catch-all case that was added is its not obvious to people 
creating new messages that reflection is going to be used on them and that 
there are compatibility issues in play.  As a result, the serializer was 
actually calling methods on our events that were causing side-effects to occur 
({{source.getOffset}}), which was really surprising.  One way to avoid 
surprises is to require manual serialization, but there are other things we can 
do.

I'm not strongly against automatic serialization, but I think we need some 
guidelines for its use. Straw man:
 - use separate case classes instead of internal objects
 - a limited set of types that we support (I've seen jackson do weird things 
with collections / options)

Perhaps there needs to be a trait or something that you mix in that states, "I 
expect this to be JSON serialized and I understand the compatibility rules".

> Use Jackson to serialize all events when writing event log
> --
>
> Key: SPARK-12141
> URL: https://issues.apache.org/jira/browse/SPARK-12141
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> SPARK-11206 added infrastructure to serialize events using Jackson, so that 
> manual serialization code is not needed anymore.
> We should write all events using that support, and remove all the manual 
> serialization code in {{JsonProtocol}}.
> Since the event log format is a semi-public API, I'm targeting this at 2.0. 
> Also, we can't remove the manual deserialization code, since we need to be 
> able to read old event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [Spark 2.0 state store] Streaming wordcount using spark state store

2016-05-18 Thread Michael Armbrust
The state store for structured streaming is an internal concept, and isn't
designed to be consumed by end users.  I'm hoping to write some
documentation on how to do aggregation, but support for reading from Kafka
and other sources will likely come in Spark 2.1+

On Wed, May 18, 2016 at 3:50 AM, Shekhar Bansal <
shekhar0...@yahoo.com.invalid> wrote:

> Hi
>
> What is the right way of using spark2.0 state store feature in spark
> streaming??
> I referred test cases in this(
> https://github.com/apache/spark/pull/11645/files) pull request and
> implemented word count using state store.
> My source is kafka(1 topic, 10 partitions). My data pump is pushing
> numbers into random partition.
> I understand that state store maintains state per partition, so I am
> applying partitionBy before calling mapPartitionsWithStateStore.
>
> Problem I am facing is that after some time, I start getting wrong running
> count.
> My data pump is pushing number 1 every 5 seconds, which is same as
> microbatch duration. First 20 micro batches ran fine but in 21st microbatch
> state of 1 somehow got reset and I got count=1, please see console output.
>
> Code of my streaming app
> val keySchema = StructType(Seq(StructField("key", StringType, true)))
> val valueSchema = StructType(Seq(StructField("value", IntegerType,
> true)))
>
> val stateStoreCheckpointPath = "/data/spark/stateStoreCheckpoints/"
> var stateStoreVersion:Long = 0
>
> val stateStoreWordCount = (store: StateStore, iter: Iterator[String])
> => {
>   val out = new ListBuffer[(String, Int)]
>   iter.foreach { s =>
> val current = store.get(stringToRow(s)).map(rowToInt).getOrElse(0)
> + 1
> store.put(stringToRow(s), intToRow(current))
> out.append((s,current))
>   }
>
>   store.commit
>   out.iterator
> }
>
> val opId = 100
> KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topicsSet)
>   .flatMap(r=>{r._2.split(" ")})
>   .foreachRDD((rdd, time) =>{
> rdd
> .map(r=>(r,null))
> .partitionBy(new HashPartitioner(20))
> .map(r=>r._1)
> .mapPartitionsWithStateStore(sqlContet, stateStoreCheckpointPath,
> opId, storeVersion = stateStoreVersion, keySchema,
> valueSchema)(stateStoreWordCount)
> .collect foreach(r=> println(time  + " - " + r))
> stateStoreVersion+=1
> println(time + " batch finished")
> }
>   )
>
> Code of my Data pump
> val list =
> List(1,2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97)
> while (true) {
>   list.foreach(r=>{
> if(count%r==0){
>   val productRecord = new ProducerRecord(Configs.topic,new
> Random().nextInt(10), "" , r.toString)
>   producer.send(productRecord)
> }
>   })
>   count+=1
>   Thread.sleep(5000);
> }
>
>
> Complete code is available here(
> https://github.com/zuxqoj/HelloWorld/tree/master/SparkStreamingStateStore/src/main/scala/spark/streaming/statestore/test
> )
>
>
> I am using spark on yarn in client mode.
> spark - spark-2.0.0-snapshot (git sha -
> 6c5768594fe8b910125f06e1308a8154a199447e)  - May 13, 2016
> scala - 2.10.2
> java - 1.8
> hadoop - 2.7.1
> kafka - 0.8.2.1
>
> Spark config:
> spark.executor.cores=12
> spark.executor.instances=6
>
>
> Console Output
> 146356664 ms - (1,1)
> 146356664 ms batch finished
> 1463566645000 ms - (1,2)
> 1463566645000 ms - (2,1)
> 1463566645000 ms batch finished
> 146356665 ms - (1,3)
> 146356665 ms - (3,1)
> 146356665 ms batch finished
> 1463566655000 ms - (1,4)
> 1463566655000 ms - (2,2)
> 1463566655000 ms batch finished
> 14635 ms - (1,5)
> 14635 ms - (5,1)
> 14635 ms batch finished
> 146355000 ms - (1,6)
> 146355000 ms - (2,3)
> 146355000 ms - (3,2)
> 146355000 ms batch finished
> 146356667 ms - (1,7)
> 146356667 ms - (7,1)
> 146356667 ms batch finished
> 1463566675000 ms - (1,8)
> 1463566675000 ms - (2,4)
> 1463566675000 ms batch finished
> 146356668 ms - (1,9)
> 146356668 ms - (3,3)
> 146356668 ms batch finished
> 1463566685000 ms - (1,10)
> 1463566685000 ms - (2,5)
> 1463566685000 ms - (5,2)
> 1463566685000 ms batch finished
> 146356669 ms - (11,1)
> 146356669 ms - (1,11)
> 146356669 ms batch finished
> 1463566695000 ms - (1,12)
> 1463566695000 ms - (2,6)
> 1463566695000 ms - (3,4)
> 1463566695000 ms batch finished
> 146356670 ms - (1,13)
> 146356670 ms - (13,1)
> 146356670 ms batch finished
> 1463566705000 ms - (1,14)
> 1463566705000 ms - (2,7)
> 1463566705000 ms - (7,2)
> 1463566705000 ms batch finished
> 146356671 ms - (1,15)
> 146356671 ms - (3,5)
> 146356671 ms - (5,3)
> 146356671 ms batch finished
> 1463566715000 ms - (1,16)
> 1463566715000 ms - (2,8)
> 1463566715000 ms batch finished
> 146356672 ms - (1,17)
> 146356672 ms - (17,1)
> 146356672 ms batch finished
>

[jira] [Updated] (SPARK-15384) Codegen CompileException "mapelements_isNull" is not an rvalue

2016-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15384:
-
Assignee: Wenchen Fan
Target Version/s: 2.0.0

> Codegen CompileException "mapelements_isNull" is not an rvalue
> --
>
> Key: SPARK-15384
> URL: https://issues.apache.org/jira/browse/SPARK-15384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Assignee: Wenchen Fan
>Priority: Critical
>
> this runs fine:
> {noformat}
> val df = sc.parallelize(List(("1", "2"), ("3", "4"))).toDF("a", "b")
> df
>   .map(row => row)(RowEncoder(df.schema))
>   .select("a", "b")
>   .show
> {noformat}
> however this fails:
> {noformat}
> val df = sc.parallelize(List(("1", "2"), ("3", "4"))).toDF("a", "b")
> df
>   .map(row => row)(RowEncoder(df.schema))
>   .select("b", "a")
>   .show
> {noformat}
> the error is:
> {noformat}
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 94, Column 57: Expression "mapelements_isNull" is not an rvalue
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /** Codegened pipeline for:
> /* 006 */ * Project [b#11,a#10]
> /* 007 */ +- SerializeFromObject [if (input[0, 
> org.apache.spark.sql.Row].isNullAt) null else staticinvoke(class org.ap...
> /* 008 */   */
> /* 009 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 010 */   private Object[] references;
> /* 011 */   private scala.collection.Iterator inputadapter_input;
> /* 012 */   private UnsafeRow project_result;
> /* 013 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
> /* 014 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter;
> /* 015 */   private Object[] deserializetoobject_values;
> /* 016 */   private org.apache.spark.sql.types.StructType 
> deserializetoobject_schema;
> /* 017 */   private UnsafeRow deserializetoobject_result;
> /* 018 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
> deserializetoobject_holder;
> /* 019 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> deserializetoobject_rowWriter;
> /* 020 */   private UnsafeRow mapelements_result;
> /* 021 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
> mapelements_holder;
> /* 022 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> mapelements_rowWriter;
> /* 023 */   private UnsafeRow serializefromobject_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
> serializefromobject_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> serializefromobject_rowWriter;
> /* 026 */   private UnsafeRow project_result1;
> /* 027 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
> project_holder1;
> /* 028 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter1;
> /* 029 */  
> /* 030 */   public GeneratedIterator(Object[] references) {
> /* 031 */ this.references = references;
> /* 032 */   }
> /* 033 */  
> /* 034 */   public void init(int index, scala.collection.Iterator inputs[]) {
> /* 035 */ partitionIndex = index;
> /* 036 */ inputadapter_input = inputs[0];
> /* 037 */ project_result = new UnsafeRow(2);
> /* 038 */ this.project_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>  64);
> /* 039 */ this.project_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>  2);
> /* 040 */
> /* 041 */ this.deserializetoobject_schema = 
> (org.apache.spark.sql.types.StructType) references[0];
> /* 042 */ deserializetoobject_result = new UnsafeRow(1);
> /* 043 */ this.deserializetoobject_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(deserializetoobject_result,
>  

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Michael Armbrust
+1, excited for 2.0!

On Wed, May 18, 2016 at 10:06 AM, Krishna Sankar 
wrote:

> +1. Looks Good.
> The mllib results are in line with 1.6.1. Deprecation messages. I will
> convert to ml and test later in the day.
> Also will try GraphX exercises for our Strata London Tutorial
>
> Quick Notes:
>
>1. pyspark env variables need to be changed
>- IPYTHON and IPYTHON_OPTS are removed
>   - This works
>  - PYSPARK_DRIVER_PYTHON=ipython
>  PYSPARK_DRIVER_PYTHON_OPTS="notebook"
>  ~/Downloads/spark-2.0.0-preview/bin/pyspark --packages
>  com.databricks:spark-csv_2.10:1.4.0
>   2.  maven 3.3.9 is required. (I was running 3.3.3)
>3.  Tons of interesting warnings and deprecations.
>   - The messages look descriptive and very helpful (Thanks. This will
>   help migration to 2.0, mllib -> ml et al). Will dig deeper.
>   4. Compiled OSX 10.10 (Yosemite) OK Total time: 31:28 min
> mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>- Spark version is 2.0.0-preview
>   - Tested pyspark, mllib (iPython 4.2.0)
>
> Cheers & Good work folks
> 
>
> On Wed, May 18, 2016 at 7:28 AM, Sean Owen  wrote:
>
>> I think it's a good idea. Although releases have been preceded before
>> by release candidates for developers, it would be good to get a formal
>> preview/beta release ratified for public consumption ahead of a new
>> major release. Better to have a little more testing in the wild to
>> identify problems before 2.0.0 is finalized.
>>
>> +1 to the release. License, sigs, etc check out. On Ubuntu 16 + Java
>> 8, compilation and tests succeed for "-Pyarn -Phive
>> -Phive-thriftserver -Phadoop-2.6".
>>
>> On Wed, May 18, 2016 at 6:40 AM, Reynold Xin  wrote:
>> > Hi,
>> >
>> > In the past the Apache Spark community have created preview packages
>> (not
>> > official releases) and used those as opportunities to ask community
>> members
>> > to test the upcoming versions of Apache Spark. Several people in the
>> Apache
>> > community have suggested we conduct votes for these preview packages and
>> > turn them into formal releases by the Apache foundation's standard.
>> Preview
>> > releases are not meant to be functional, i.e. they can and highly likely
>> > will contain critical bugs or documentation errors, but we will be able
>> to
>> > post them to the project's website to get wider feedback. They should
>> > satisfy the legal requirements of Apache's release policy
>> > (http://www.apache.org/dev/release.html) such as having proper
>> licenses.
>> >
>> >
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 PM
>> PDT
>> > and passes if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.0.0-preview
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is 2.0.0-preview
>> > (8f5a04b6299e3a47aca13cbb40e72344c0114860)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> >
>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The documentation corresponding to this release can be found at:
>> >
>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/
>> >
>> > The list of resolved issues are:
>> >
>> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0
>> >
>> >
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Apache Spark workload and running on this candidate, then
>> reporting
>> > any regressions.
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: CompileException for spark-sql generated code in 2.0.0-SNAPSHOT

2016-05-17 Thread Michael Armbrust
Yeah, can you open a JIRA with that reproduction please?  You can ping me
on it.

On Tue, May 17, 2016 at 4:55 PM, Reynold Xin  wrote:

> It seems like the problem here is that we are not using unique names
> for mapelements_isNull?
>
>
>
> On Tue, May 17, 2016 at 3:29 PM, Koert Kuipers  wrote:
>
>> hello all, we are slowly expanding our test coverage for spark
>> 2.0.0-SNAPSHOT to more in-house projects. today i ran into this issue...
>>
>> this runs fine:
>> val df = sc.parallelize(List(("1", "2"), ("3", "4"))).toDF("a", "b")
>> df
>>   .map(row => row)(RowEncoder(df.schema))
>>   .select("a", "b")
>>   .show
>>
>> however this fails:
>> val df = sc.parallelize(List(("1", "2"), ("3", "4"))).toDF("a", "b")
>> df
>>   .map(row => row)(RowEncoder(df.schema))
>>   .select("b", "a")
>>   .show
>>
>> the error is:
>> java.lang.Exception: failed to compile:
>> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line
>> 94, Column 57: Expression "mapelements_isNull" is not an rvalue
>> /* 001 */ public Object generate(Object[] references) {
>> /* 002 */   return new GeneratedIterator(references);
>> /* 003 */ }
>> /* 004 */
>> /* 005 */ /** Codegened pipeline for:
>> /* 006 */ * Project [b#11,a#10]
>> /* 007 */ +- SerializeFromObject [if (input[0,
>> org.apache.spark.sql.Row].isNullAt) null else staticinvoke(class org.ap...
>> /* 008 */   */
>> /* 009 */ final class GeneratedIterator extends
>> org.apache.spark.sql.execution.BufferedRowIterator {
>> /* 010 */   private Object[] references;
>> /* 011 */   private scala.collection.Iterator inputadapter_input;
>> /* 012 */   private UnsafeRow project_result;
>> /* 013 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
>> project_holder;
>> /* 014 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
>> project_rowWriter;
>> /* 015 */   private Object[] deserializetoobject_values;
>> /* 016 */   private org.apache.spark.sql.types.StructType
>> deserializetoobject_schema;
>> /* 017 */   private UnsafeRow deserializetoobject_result;
>> /* 018 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
>> deserializetoobject_holder;
>> /* 019 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
>> deserializetoobject_rowWriter;
>> /* 020 */   private UnsafeRow mapelements_result;
>> /* 021 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
>> mapelements_holder;
>> /* 022 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
>> mapelements_rowWriter;
>> /* 023 */   private UnsafeRow serializefromobject_result;
>> /* 024 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
>> serializefromobject_holder;
>> /* 025 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
>> serializefromobject_rowWriter;
>> /* 026 */   private UnsafeRow project_result1;
>> /* 027 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
>> project_holder1;
>> /* 028 */   private
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
>> project_rowWriter1;
>> /* 029 */
>> /* 030 */   public GeneratedIterator(Object[] references) {
>> /* 031 */ this.references = references;
>> /* 032 */   }
>> /* 033 */
>> /* 034 */   public void init(int index, scala.collection.Iterator
>> inputs[]) {
>> /* 035 */ partitionIndex = index;
>> /* 036 */ inputadapter_input = inputs[0];
>> /* 037 */ project_result = new UnsafeRow(2);
>> /* 038 */ this.project_holder = new
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>> 64);
>> /* 039 */ this.project_rowWriter = new
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>> 2);
>> /* 040 */
>> /* 041 */ this.deserializetoobject_schema =
>> (org.apache.spark.sql.types.StructType) references[0];
>> /* 042 */ deserializetoobject_result = new UnsafeRow(1);
>> /* 043 */ this.deserializetoobject_holder = new
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(deserializetoobject_result,
>> 32);
>> /* 044 */ this.deserializetoobject_rowWriter = new
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(deserializetoobject_holder,
>> 1);
>> /* 045 */ mapelements_result = new UnsafeRow(1);
>> /* 046 */ this.mapelements_holder = new
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mapelements_result,
>> 32);
>> /* 047 */ this.mapelements_rowWriter = new
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mapelements_holder,
>> 1);
>> /* 048 */ serializefromobject_result = new UnsafeRow(2);
>> /* 049 */ this.serializefromobject_holder = new
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result,
>> 64);
>> /* 050 */ this.serializefromobject_rowWriter = new
>> org.apache.spark.sql.catalys

[jira] [Updated] (SPARK-15367) Add refreshTable back

2016-05-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15367:
-
Priority: Critical  (was: Major)

> Add refreshTable back 
> --
>
> Key: SPARK-15367
> URL: https://issues.apache.org/jira/browse/SPARK-15367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> refreshTable was a method in HiveContext. It was deleted accidentally while 
> we were migrating the APIs. Let's add it back to HiveContext. For 
> SparkSession, let's put it under the catalog namespace 
> (SparkSession.catalog.refreshTable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Michael Armbrust
In 2.0 you won't be able to do this.  The long term vision would be to make
this possible, but a window will be required (like the 24 hours you
suggest).

On Tue, May 17, 2016 at 1:36 AM, Todd  wrote:

> Hi,
> We have a requirement to do count(distinct) in a processing batch against
> all the streaming data(eg, last 24 hours' data),that is,when we do
> count(distinct),we actually want to compute distinct against last 24 hours'
> data.
> Does structured streaming support this scenario?Thanks!
>


Re: Inferring schema from GenericRowWithSchema

2016-05-17 Thread Michael Armbrust
I don't think that you will be able to do that.  ScalaReflection is based
on the TypeTag of the object, and thus the schema of any particular object
won't be available to it.

Instead I think you want to use the register functions in UDFRegistration
that take a schema. Does that make sense?

On Tue, May 17, 2016 at 11:48 AM, Andy Grove 
wrote:

>
> Hi,
>
> I have a requirement to create types dynamically in Spark and then
> instantiate those types from Spark SQL via a UDF.
>
> I tried doing the following:
>
> val addressType = StructType(List(
>   new StructField("state", DataTypes.StringType),
>   new StructField("zipcode", DataTypes.IntegerType)
> ))
>
> sqlContext.udf.register("Address", (args: Seq[Any]) => new
> GenericRowWithSchema(args.toArray, addressType))
>
> sqlContext.sql("SELECT Address('NY', 12345)").show(10)
>
> This seems reasonable to me but this fails with:
>
> Exception in thread "main" java.lang.UnsupportedOperationException: Schema
> for type org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is
> not supported
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:685)
> at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:130)
>
> It looks like it would be simple to update ScalaReflection to be able to
> infer the schema from a GenericRowWithSchema, but before I file a JIRA and
> submit a patch I wanted to see if there is already a way of achieving this.
>
> Thanks,
>
> Andy.
>
>
>


[jira] [Resolved] (SPARK-10216) Avoid creating empty files during overwrite into Hive table with group by query

2016-05-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10216.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12855
[https://github.com/apache/spark/pull/12855]

> Avoid creating empty files during overwrite into Hive table with group by 
> query
> ---
>
> Key: SPARK-10216
> URL: https://issues.apache.org/jira/browse/SPARK-10216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Keuntae Park
>Priority: Minor
> Fix For: 2.0.0
>
>
> Exchange from GROUP BY query results in at least certain amount of partitions 
> specified in 'spark.sql.shuffle.partition'.
> Hence, even when the number of distinct group-by key is small, 
> INSERT INTO with GROUP BY query try to make at least 200 files (default value 
> of 'spark.sql.shuffle.partition'), 
> which results in lots of empty files.
> I think it is undesirable because upcoming queries on the resulting table 
> will also make zero size partitions and unnecessary tasks do nothing on 
> handling the queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-13 Thread Michael Armbrust
+1 to the general structure of Reynold's proposal.  I've found what we do
currently a little confusing.  In particular, it doesn't make much sense
that @DeveloperApi things are always labeled as possibly changing.  For
example the Data Source API should arguably be one of the most stable
interfaces since its very difficult for users to recompile libraries that
might break when there are changes.

For a similar reason, I don't really see the point of LimitedPrivate.  The
goal here should be communication of promises of stability or future
stability.

Regarding Developer vs. Public. I don't care too much about the naming, but
it does seem useful to differentiate APIs that we expect end users to
consume from those that are used to augment Spark. "Library" and
"Application" also seem reasonable.

On Fri, May 13, 2016 at 11:15 AM, Marcelo Vanzin 
wrote:

> On Fri, May 13, 2016 at 10:18 AM, Sean Busbey  wrote:
> > I think LimitedPrivate gets a bad rap due to the way it is misused in
> > Hadoop. The use case here -- "we offer this to developers of
> > intermediate layers; those willing to update their software as we
> > update ours"
>
> I think "LimitedPrivate" is a rather confusing name for that. I think
> Reynold's first e-mail better matches that use case: this would be
> "InterfaceAudience(Developer)" and "InterfaceStability(Experimental)".
>
> But I don't really like "Developer" as a name here, because it's
> ambiguous. Developer of what? Theoretically everybody writing Spark or
> on top of its APIs is a developer. In that sense, I prefer using
> something like "Library" and "Application" instead of "Developer" and
> "Public".
>
> Personally, in fact, I don't see a lot of gain in differentiating
> between the target users of an interface... knowing whether it's a
> stable interface or not is a lot more useful. If you're equating a
> "developer API" with "it's not really stable", then you don't really
> need two annotations for that - just say it's not stable.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Michael Armbrust
>
>
> logical plan after optimizer execution:
>
> Project [id#0L,id#1L]
> !+- Filter (id#0L = cast(1 as bigint))
> !   +- Join Inner, Some((id#0L = id#1L))
> !  :- Subquery t
> !  :  +- Relation[id#0L] JSONRelation
> !  +- Subquery u
> !  +- Relation[id#1L] JSONRelation
>

I think you are mistaken.  If this was the optimized plan there would be no
subqueries.


Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Armbrust
That is a forward looking design doc and not all of it has been implemented
yet.  With Spark 2.0 the main sources and sinks will be file based, though
we hope to quickly expand that now that a lot of infrastructure is in place.

On Fri, May 6, 2016 at 2:11 PM, Ted Yu  wrote:

> I was
> reading 
> StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
> attached to SPARK-8360
>
> On page 12, there was mentioning of .format(“kafka”) but I searched the
> codebase and didn't find any occurrence.
>
> FYI
>
> On Fri, May 6, 2016 at 1:06 PM, Michael Malak <
> michaelma...@yahoo.com.invalid> wrote:
>
>> At first glance, it looks like the only streaming data sources available
>> out of the box from the github master branch are
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
>>  and
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala
>>  .
>> Out of the Jira epic for Structured Streaming
>> https://issues.apache.org/jira/browse/SPARK-8360 it would seem the
>> still-open https://issues.apache.org/jira/browse/SPARK-10815 "API
>> design: data sources and sinks" is relevant here.
>>
>> In short, it would seem the code is not there yet to create a Kafka-fed
>> Dataframe/Dataset that can be queried with Structured Streaming; or if it
>> is, it's not obvious how to write such code.
>>
>>
>> --
>> *From:* Anthony May 
>> *To:* Deepak Sharma ; Sunita Arvind <
>> sunitarv...@gmail.com>
>> *Cc:* "user@spark.apache.org" 
>> *Sent:* Friday, May 6, 2016 11:50 AM
>> *Subject:* Re: Adhoc queries on Spark 2.0 with Structured Streaming
>>
>> Yeah, there isn't even a RC yet and no documentation but you can work off
>> the code base and test suites:
>> https://github.com/apache/spark
>> And this might help:
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala
>>
>> On Fri, 6 May 2016 at 11:07 Deepak Sharma  wrote:
>>
>> Spark 2.0 is yet to come out for public release.
>> I am waiting to get hands on it as well.
>> Please do let me know if i can download source and build spark2.0 from
>> github.
>>
>> Thanks
>> Deepak
>>
>> On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind 
>> wrote:
>>
>> Hi All,
>>
>> We are evaluating a few real time streaming query engines and spark is my
>> personal choice. The addition of adhoc queries is what is getting me
>> further excited about it, however the talks I have heard so far only
>> mention about it but do not provide details. I need to build a prototype to
>> ensure it works for our use cases.
>>
>> Can someone point me to relevant material for this.
>>
>> regards
>> Sunita
>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>>
>>
>>
>


[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null

2016-05-05 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273100#comment-15273100
 ] 

Michael Armbrust commented on SPARK-15140:
--

The 2.0 behavior seems correct.  Ideally .toDS().collect() will always 
round-trip the data without change.

> ensure input object of encoder is not null
> --
>
> Key: SPARK-15140
> URL: https://issues.apache.org/jira/browse/SPARK-15140
> Project: Spark
>  Issue Type: Improvement
>Reporter: Wenchen Fan
>
> Current we assume the input object for encoder won't be null, but we don't 
> check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, 
> in 2.0 this will return Array("a", null).
> We should define this behaviour more clearly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-05-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14959:
-
Priority: Critical  (was: Major)

> ​Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Critical
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372)
>   at 
> org.apache.spark.sql.executio

[jira] [Updated] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-05-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14959:
-
Target Version/s: 2.0.0
 Component/s: (was: Input/Output)
  SQL

> ​Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala

Re: Accessing JSON array in Spark SQL

2016-05-05 Thread Michael Armbrust
use df.selectExpr to evaluate complex expression (instead of just column
names).

On Thu, May 5, 2016 at 11:53 AM, Xinh Huynh  wrote:

> Hi,
>
> I am having trouble accessing an array element in JSON data with a
> dataframe. Here is the schema:
>
> val json1 = """{"f1":"1", "f1a":[{"f2":"2"}] } }"""
> val rdd1 = sc.parallelize(List(json1))
> val df1 = sqlContext.read.json(rdd1)
> df1.printSchema()
>
> root |-- f1: string (nullable = true) |-- f1a: array (nullable = true) |
> |-- element: struct (containsNull = true) | | |-- f2: string (nullable =
> true)
>
> I would expect to be able to select the first element of "f1a" this way:
> df1.select("f1a[0]").show()
>
> org.apache.spark.sql.AnalysisException: cannot resolve 'f1a[0]' given
> input columns f1, f1a;
>
> This is with Spark 1.6.0.
>
> Please help. A follow-up question is: can I access arbitrary levels of
> nested JSON array of struct of array of struct?
>
> Thanks,
> Xinh
>


[jira] [Resolved] (SPARK-15077) StreamExecution.awaitOffset may take too long because of thread starvation

2016-05-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15077.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12852
[https://github.com/apache/spark/pull/12852]

> StreamExecution.awaitOffset may take too long because of thread starvation
> --
>
> Key: SPARK-15077
> URL: https://issues.apache.org/jira/browse/SPARK-15077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Right now StreamExecution.awaitBatchLock uses an unfair lock. It may cause 
> that StreamExecution.awaitOffset runs too long and fails some test because of 
> timeout. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/865/testReport/junit/org.apache.spark.sql.streaming/FileStreamSourceStressTestSuite/file_source_stress_test/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15062) Show on DataFrame causes OutOfMemoryError, NegativeArraySizeException or segfault

2016-05-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15062:
-
Assignee: Bo Meng

> Show on DataFrame causes OutOfMemoryError, NegativeArraySizeException or 
> segfault 
> --
>
> Key: SPARK-15062
> URL: https://issues.apache.org/jira/browse/SPARK-15062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT using commit hash 
> 90787de864b58a1079c23e6581381ca8ffe7685f and Java 1.7.0_67
>Reporter: koert kuipers
>Assignee: Bo Meng
>Priority: Blocker
> Fix For: 2.0.0
>
>
> {noformat}
> scala> val dfComplicated = sc.parallelize(List((Map("1" -> "a"), List("b", 
> "c")), (Map("2" -> "b"), List("d", "e".toDF
> ...
> dfComplicated: org.apache.spark.sql.DataFrame = [_1: map, _2: 
> array]
> scala> dfComplicated.show
> java.lang.OutOfMemoryError: Java heap space
>   at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:821)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2121)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2408)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2120)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2127)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1861)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1860)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2438)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1860)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2077)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:238)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:529)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:489)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:498)
>   ... 6 elided
> scala>
> {noformat}
> By increasing memory to 8G one will instead get a NegativeArraySizeException 
> or a segfault.
> See here for original discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-2-segfault-td17381.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15062) Show on DataFrame causes OutOfMemoryError, NegativeArraySizeException or segfault

2016-05-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15062.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12849
[https://github.com/apache/spark/pull/12849]

> Show on DataFrame causes OutOfMemoryError, NegativeArraySizeException or 
> segfault 
> --
>
> Key: SPARK-15062
> URL: https://issues.apache.org/jira/browse/SPARK-15062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark-2.0.0-SNAPSHOT using commit hash 
> 90787de864b58a1079c23e6581381ca8ffe7685f and Java 1.7.0_67
>Reporter: koert kuipers
>Priority: Blocker
> Fix For: 2.0.0
>
>
> {noformat}
> scala> val dfComplicated = sc.parallelize(List((Map("1" -> "a"), List("b", 
> "c")), (Map("2" -> "b"), List("d", "e".toDF
> ...
> dfComplicated: org.apache.spark.sql.DataFrame = [_1: map, _2: 
> array]
> scala> dfComplicated.show
> java.lang.OutOfMemoryError: Java heap space
>   at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:821)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2121)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2408)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2120)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2127)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1861)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1860)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2438)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1860)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2077)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:238)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:529)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:489)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:498)
>   ... 6 elided
> scala>
> {noformat}
> By increasing memory to 8G one will instead get a NegativeArraySizeException 
> or a segfault.
> See here for original discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-2-segfault-td17381.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14747) Add assertStreaming/assertNoneStreaming checks in DataFrameWriter

2016-05-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14747.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12521
[https://github.com/apache/spark/pull/12521]

> Add assertStreaming/assertNoneStreaming checks in DataFrameWriter
> -
>
> Key: SPARK-14747
> URL: https://issues.apache.org/jira/browse/SPARK-14747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
> Fix For: 2.0.0
>
>
> If an end user happens to write the code mixed with continuous-query-oriented 
> methods and non-continuous-query-oriented methods:
> {code}
> ctx.read
>.format("text")
>.stream("...")  // continuous query
>.write
>.text("...")// non-continuous query
> {code}
> He/she would get somehow a confusing exception:
> {quote}
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> plan for FileSource\[./continuous_query_test_input\]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at ...
> {quote}
> This JIRA proposes to add checks for continuous-query-oriented methods and 
> non-continuous-query-oriented methods in `DataFrameWriter`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14830) Add RemoveRepetitionFromGroupExpressions optimizer

2016-05-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14830.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12590
[https://github.com/apache/spark/pull/12590]

> Add RemoveRepetitionFromGroupExpressions optimizer
> --
>
> Key: SPARK-14830
> URL: https://issues.apache.org/jira/browse/SPARK-14830
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Reporter: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to optimize `GroupExpressions` by removing repeating 
> expressions. **RemoveRepetitionFromGroupExpressions** is added.
> **Before**
> {code}
> scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 
> 1+A").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + 
> A#0)#9], functions=[], output=[(a + 1)#5])
> : +- INPUT
> +- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + 
> A#0)#9, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + 
> a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], 
> output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9])
>   : +- INPUT
>   +- LocalTableScan [a#0], [[1],[2]]
> {code}
> **After**
> {code}
> scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 
> 1+A").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5])
> : +- INPUT
> +- Exchange hashpartitioning((a#0 + 1)#6, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], 
> output=[(a#0 + 1)#6])
>   : +- INPUT
>   +- LocalTableScan [a#0], [[1],[2]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14579) Fix a race condition in StreamExecution.processAllAvailable

2016-05-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14579.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12582
[https://github.com/apache/spark/pull/12582]

> Fix a race condition in StreamExecution.processAllAvailable
> ---
>
> Key: SPARK-14579
> URL: https://issues.apache.org/jira/browse/SPARK-14579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> See the PR description for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14637) object expressions cleanup

2016-05-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14637.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12399
[https://github.com/apache/spark/pull/12399]

> object expressions cleanup
> --
>
> Key: SPARK-14637
> URL: https://issues.apache.org/jira/browse/SPARK-14637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14970) DataSource enumerates all files in FileCatalog to infer schema even if there is user specified schema

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14970.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> DataSource enumerates all files in FileCatalog to infer schema even if there 
> is user specified schema
> -
>
> Key: SPARK-14970
> URL: https://issues.apache.org/jira/browse/SPARK-14970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14997) Files in subdirectories are incorrectly considered in sqlContext.read.json()

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14997:
-
Labels: regresion  (was: )

> Files in subdirectories are incorrectly considered in sqlContext.read.json()
> 
>
> Key: SPARK-14997
> URL: https://issues.apache.org/jira/browse/SPARK-14997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Priority: Critical
>  Labels: regresion
>
> Lets says there are json files in the following directories structure
> xyz/file0.json
> xyz/subdir1/file1.json
> xyz/subdir2/file2.json
> xyz/subdir1/subsubdir1/file3.json
> sqlContext.read.json("xyz") should read only file0.json according to behavior 
> in Spark 1.6.1. However in current master, all the 4 files are read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15011) org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelations fails when hadoop 2.3 or hadoop 2.4 is used

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15011:
-
Priority: Critical  (was: Major)

> org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelations fails 
> when hadoop 2.3 or hadoop 2.4 is used
> 
>
> Key: SPARK-15011
> URL: https://issues.apache.org/jira/browse/SPARK-15011
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Priority: Critical
>  Labels: flaky-test
>
> Let's disable it first.
> https://spark-tests.appspot.com/tests/org.apache.spark.sql.hive.StatisticsSuite/analyze%20MetastoreRelations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14993:
-
Priority: Critical  (was: Major)

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, we if set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14337) Push down casts beneath CaseWhen and If expressions

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14337.
--
  Resolution: Later
Target Version/s:   (was: 2.0.0)

Closing as later since the PR was closed.

> Push down casts beneath CaseWhen and If expressions
> ---
>
> Key: SPARK-14337
> URL: https://issues.apache.org/jira/browse/SPARK-14337
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> If we have a CAST on top of a CASE WHEN or IF expression, then we should push 
> the cast to each conditional branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14981) CatalogTable should contain sorting directions of sorting columns

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14981.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12759
[https://github.com/apache/spark/pull/12759]

> CatalogTable should contain sorting directions of sorting columns
> -
>
> Key: SPARK-14981
> URL: https://issues.apache.org/jira/browse/SPARK-14981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
> Fix For: 2.0.0
>
>
> For a bucketed table with sorting columns, {{CatalogTable}} only records 
> sorting column names, while sorting directions (ASC/DESC) are missing.
> Our SQL parser supports the syntax, but sorting directions are silently 
> dropped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-04-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264836#comment-15264836
 ] 

Michael Armbrust commented on SPARK-6817:
-

[~shivaram] Sill trying to get any of this in Spark 2.0?  Or should we retarget?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14997) Files in subdirectories are incorrectly considered in sqlContext.read.json()

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14997:
-
Priority: Critical  (was: Major)

> Files in subdirectories are incorrectly considered in sqlContext.read.json()
> 
>
> Key: SPARK-14997
> URL: https://issues.apache.org/jira/browse/SPARK-14997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Priority: Critical
>  Labels: regresion
>
> Lets says there are json files in the following directories structure
> xyz/file0.json
> xyz/subdir1/file1.json
> xyz/subdir2/file2.json
> xyz/subdir1/subsubdir1/file3.json
> sqlContext.read.json("xyz") should read only file0.json according to behavior 
> in Spark 1.6.1. However in current master, all the 4 files are read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12854) Vectorize Parquet reader

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12854:
-
Assignee: Nong Li

> Vectorize Parquet reader
> 
>
> Key: SPARK-12854
> URL: https://issues.apache.org/jira/browse/SPARK-12854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12854) Vectorize Parquet reader

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12854.
--
Resolution: Fixed

Closing since all subtasks are done

> Vectorize Parquet reader
> 
>
> Key: SPARK-12854
> URL: https://issues.apache.org/jira/browse/SPARK-12854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13421) Make output of a SparkPlan configurable

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13421:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> Make output of a SparkPlan configurable
> ---
>
> Key: SPARK-13421
> URL: https://issues.apache.org/jira/browse/SPARK-13421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> A SparkPlan currently outputs a iterator of {{InternalRow}}'s. This works 
> fine for many purposes. However an iterator is not a natural fit when we need 
> a (broadcasted) index or a {{ColumnarBatch}}.
> A SparkPlan should be able to define the shape/form/organization of the 
> output of its children.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12852) Support create table DDL with bucketing

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12852.
--
Resolution: Later

> Support create table DDL with bucketing
> ---
>
> Key: SPARK-12852
> URL: https://issues.apache.org/jira/browse/SPARK-12852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12849) Bucketing improvements follow-up

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12849.
--
  Resolution: Later
Target Version/s:   (was: 2.0.0)

> Bucketing improvements follow-up
> 
>
> Key: SPARK-12849
> URL: https://issues.apache.org/jira/browse/SPARK-12849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a follow-up ticket for SPARK-12538 to improvement bucketing support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12851) Add the ability to understand tables bucketed by Hive

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12851.
--
Resolution: Later

> Add the ability to understand tables bucketed by Hive
> -
>
> Key: SPARK-12851
> URL: https://issues.apache.org/jira/browse/SPARK-12851
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We added bucketing functionality, but we current do not understand the 
> bucketing properties if a table is generated by Hive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13571) Track current database in SQL/HiveContext

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13571.
--
Resolution: Fixed

I think this was done as part of another PR.

> Track current database in SQL/HiveContext
> -
>
> Key: SPARK-13571
> URL: https://issues.apache.org/jira/browse/SPARK-13571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We already have internal APIs for Hive to do this. We should do it for 
> SQLContext too so we can merge these code paths one day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13424) Improve test coverage of EnsureRequirements

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13424.
--
Resolution: Later

Closing along with the PR, reopen when you have time.

> Improve test coverage of EnsureRequirements
> ---
>
> Key: SPARK-13424
> URL: https://issues.apache.org/jira/browse/SPARK-13424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> The testing of EnsureRequirements is current quite limited. We should bring 
> this up to par.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14273) Add FileFormat.isSplittable to indicate whether a format is splittable

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14273:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> Add FileFormat.isSplittable to indicate whether a format is splittable
> --
>
> Key: SPARK-14273
> URL: https://issues.apache.org/jira/browse/SPARK-14273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{FileSourceStrategy}} assumes that all data source formats are splittable 
> and always splits data files by fixed partition size. However, not all HDSF 
> based formats are splittable. We need a flag to indicate that and ensure that 
> non-splittable files won't be split into multiple Spark partitions.
> (PS: Is it "splitable" or "splittable"? Probably the latter one? Hadoop uses 
> the former one though...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14237) De-duplicate partition value appending logic in various buildReader() implementations

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14237:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> De-duplicate partition value appending logic in various buildReader() 
> implementations
> -
>
> Key: SPARK-14237
> URL: https://issues.apache.org/jira/browse/SPARK-14237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Various data sources share approximately the same code for partition value 
> appending. Would be nice to make it a utility method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14237) De-duplicate partition value appending logic in various buildReader() implementations

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14237:
-
Parent Issue: SPARK-13682  (was: SPARK-13664)

> De-duplicate partition value appending logic in various buildReader() 
> implementations
> -
>
> Key: SPARK-14237
> URL: https://issues.apache.org/jira/browse/SPARK-14237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Various data sources share approximately the same code for partition value 
> appending. Would be nice to make it a utility method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13683) Finalize the public interface for OutputWriter[Factory]

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13683:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> Finalize the public interface for OutputWriter[Factory]
> ---
>
> Key: SPARK-13683
> URL: https://issues.apache.org/jira/browse/SPARK-13683
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>
> We need to at least remove bucketing.  I would also like to remove {{Job}} 
> and the configuration stuff as well if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14273) Add FileFormat.isSplittable to indicate whether a format is splittable

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14273:
-
Parent Issue: SPARK-13682  (was: SPARK-13664)

> Add FileFormat.isSplittable to indicate whether a format is splittable
> --
>
> Key: SPARK-14273
> URL: https://issues.apache.org/jira/browse/SPARK-14273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{FileSourceStrategy}} assumes that all data source formats are splittable 
> and always splits data files by fixed partition size. However, not all HDSF 
> based formats are splittable. We need a flag to indicate that and ensure that 
> non-splittable files won't be split into multiple Spark partitions.
> (PS: Is it "splitable" or "splittable"? Probably the latter one? Hadoop uses 
> the former one though...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13683) Finalize the public interface for OutputWriter[Factory]

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13683:
-
Parent Issue: SPARK-13682  (was: SPARK-13664)

> Finalize the public interface for OutputWriter[Factory]
> ---
>
> Key: SPARK-13683
> URL: https://issues.apache.org/jira/browse/SPARK-13683
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>
> We need to at least remove bucketing.  I would also like to remove {{Job}} 
> and the configuration stuff as well if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15016) Simplify and Speedup HadoopFSRelation (follow-up)

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15016.
--
Resolution: Duplicate

> Simplify and Speedup HadoopFSRelation (follow-up)
> -
>
> Key: SPARK-15016
> URL: https://issues.apache.org/jira/browse/SPARK-15016
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a follow-up to SPARK-13664.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13682) Finalize the public API for FileFormat

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13682:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> Finalize the public API for FileFormat
> --
>
> Key: SPARK-13682
> URL: https://issues.apache.org/jira/browse/SPARK-13682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>    Reporter: Michael Armbrust
>
> The current file format interface needs to be cleaned up before its 
> acceptable for public consumption:
>  - Have a version that takes Row and does a conversion, hide the internal API.
>  - Remove bucketing
>  - Remove RDD and the broadcastedConf
>  - Remove SQLContext (maybe include SparkSession?)
>  - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13682) Finalize the public API for FileFormat

2016-04-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13682:
-
Issue Type: New Feature  (was: Sub-task)
Parent: (was: SPARK-13664)

> Finalize the public API for FileFormat
> --
>
> Key: SPARK-13682
> URL: https://issues.apache.org/jira/browse/SPARK-13682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>    Reporter: Michael Armbrust
>
> The current file format interface needs to be cleaned up before its 
> acceptable for public consumption:
>  - Have a version that takes Row and does a conversion, hide the internal API.
>  - Remove bucketing
>  - Remove RDD and the broadcastedConf
>  - Remove SQLContext (maybe include SparkSession?)
>  - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: How can I bucketize / group a DataFrame from parquet files?

2016-04-27 Thread Michael Armbrust
Unfortunately, I don't think there is an easy way to do this in 1.6.  In
Spark 2.0 we will make DataFrame = Dataset[Row], so this should work out of
the box.

On Mon, Apr 25, 2016 at 11:08 PM, Brandon White 
wrote:

> I am creating a dataFrame from parquet files. The schema is based on the
> parquet files, I do not know it before hand. What I want to do is group the
> entire DF into buckets based on a column.
>
> val df = sqlContext.read.parquet("/path/to/files")
> val groupedBuckets: DataFrame[String, Array[Rows]] =
> df.groupBy($"columnName")
>
> I know this does not work because the DataFrame's groupBy is only used for
> aggregate functions. I cannot convert my DataFrame to a DataSet because I
> do not have a case class for the DataSet schema. The only thing I can do is
> convert the df to an RDD[Rows] and try to deal with the types. This is ugly
> and difficult.
>
> Is there any better way? Can I convert a DataFrame to a DataSet without a
> predefined case class?
>
> Brandon
>


[jira] [Resolved] (SPARK-14874) Remove the obsolete Batch representation

2016-04-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14874.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12638
[https://github.com/apache/spark/pull/12638]

> Remove the obsolete Batch representation
> 
>
> Key: SPARK-14874
> URL: https://issues.apache.org/jira/browse/SPARK-14874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
> Fix For: 2.0.0
>
>
> The Batch class, which had been used to indicate progress in a stream, was 
> abandoned by SPARK-13985 and then became useless.
> Let's:
> - removes the Batch class
> - -renames getBatch(...) to getData(...) for Source- (update: as discussed in 
> the PR, this is not necessary)
> - -renames addBatch(...) to addData(...) for Sink- (update: as discussed in 
> the PR, this is not necessary)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: XML Data Source for Spark

2016-04-25 Thread Michael Armbrust
You are using a version of the library that was compiled for a different
version of Scala than the version of Spark that you are using.  Make sure
that they match up.

On Mon, Apr 25, 2016 at 5:19 PM, Mohamed ismail  wrote:

> here is an example with code.
> http://stackoverflow.com/questions/33078221/xml-processing-in-spark
>
> I haven't tried.
>
>
> On Monday, April 25, 2016 1:06 PM, Jinan Alhajjaj <
> j.r.alhaj...@hotmail.com> wrote:
>
>
> Hi All,
> I am trying to use XML data source that is used for parsing and querying
> XML data with Apache Spark, for Spark SQL and data frames.I am using Apache
> spark version 1.6.1 and I am using Java as a programming language.
> I wrote this sample code :
> SparkConf conf = new SparkConf().setAppName("parsing").setMaster("local");
>
> JavaSparkContext sc = new JavaSparkContext(conf);
>
> SQLContext sqlContext = new SQLContext(sc);
>
> DataFrame df =
> sqlContext.read().format("com.databricks.spark.xml").option("rowtag",
> "page").load("file.xml");
>
> When I run this code I faced a problem which is
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
> at com.databricks.spark.xml.XmlOptions.(XmlOptions.scala:26)
> at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:48)
> at
> com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:58)
> at
> com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:44)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at datbricxml.parsing.main(parsing.java:16).
> Please, I need to solve this error for my senior project ASAP.
>
>
>
>


Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never
needed to eagerly calculate the range boundaries (since Spark 1.0).

On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao 
wrote:

> Thanks Reynold for the reason as to why sortBykey invokes a Job
>
> When you say "DataFrame/Dataset does not have this issue" is it right to
> assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in
> it?
>
> Thanking You
>
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:Reynold Xin 
> To:Praveen Devarao/India/IBM@IBMIN
> Cc:"dev@spark.apache.org" , user <
> u...@spark.apache.org>
> Date:25/04/2016 11:26 am
> Subject:Re: Do transformation functions on RDD invoke a Job
> [sc.runJob]?
> --
>
>
>
> Usually no - but sortByKey does because it needs the range boundary to be
> built in order to have the RDD. It is a long standing problem that's
> unfortunately very difficult to solve without breaking the RDD API.
>
> In DataFrame/Dataset we don't have this issue though.
>
>
> On Sun, Apr 24, 2016 at 10:54 PM, Praveen Devarao <*praveen...@in.ibm.com*
> > wrote:
> Hi,
>
> I have a streaming program with the block as below [ref:
> *https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala*
> 
> ]
>
> *1 val **lines *= *messages*.map(_._2)
> *2 val **hashTags *= *lines*.flatMap(status => status.split(*" "*
> ).filter(_.startsWith(*"#"*)))
>
> *3 val **topCounts60 *= *hashTags*.map((_, 1)).reduceByKey( _ + _ )
> *3a* .map { *case *(topic, count) => (count, topic) }
> *3b* .transform(_.sortByKey(*false*))
>
> *4a**topCounts60*.foreachRDD( rdd => {
> *4b* *val *topList = rdd.take( 10 )
> })
>
> This batch is triggering 2 jobs...one at line *3b**(sortByKey)* and
> the other at *4b (rdd.take) *I agree that there is a Job triggered on
> line 4b as take() is an action on RDD while as on line 3b sortByKey is just
> a transformation function which as per docs is lazy evaluation...but I see
> that this line uses a RangePartitioner and Rangepartitioner on
> initialization invokes a method called *sketch() *that invokes *collect()*
> triggering a Job.
>
> My question: Is it expected that sortByKey will invoke a Job...if
> yes, why is sortByKey listed as a transformation and not action. Are there
> any other functions like this that invoke a Job, though they are
> transformations and not actions?
>
> I am on Spark 1.6
>
> Thanking You
>
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
>


Re: Defining case class within main method throws "No TypeTag available for Accounts"

2016-04-25 Thread Michael Armbrust
When you define a class inside of a method, it implicitly has a pointer to
the outer scope of the method.  Spark doesn't have access to this scope, so
this makes it hard (impossible?) for us to construct new instances of that
class.

So, define your classes that you plan to use with Spark at the top level.

On Mon, Apr 25, 2016 at 9:36 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> I notice buiding with sbt if I define my case class *outside of main
> method* like below it works
>
>
> case class Accounts( TransactionDate: String, TransactionType: String,
> Description: String, Value: Double, Balance: Double, AccountName: String,
> AccountNumber : String)
>
> object Import_nw_10124772 {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>setAppName("Import_nw_10124772").
>setMaster("local[12]").
>set("spark.driver.allowMultipleContexts", "true").
>set("spark.hadoop.validateOutputSpecs", "false")
>   val sc = new SparkContext(conf)
>   // Create sqlContext based on HiveContext
>   val sqlContext = new HiveContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>   //
>   // Get a DF first based on Databricks CSV libraries ignore column
> heading because of column called "Type"
>   //
>   val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
>   //df.printSchema
>   //
>val a = df.filter(col("Date") > "").map(p =>
> Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
>
>
> However, if I put that case class with the main method, it throws "No
> TypeTag available for Accounts" error
>
> Apparently when case class is defined inside of the method that it is
> being used, it is not fully defined at that point.
>
> Is this a bug within Spark?
>
> Thanks
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-25 Thread Michael Armbrust
Spark SQL's query planner has always delayed building the RDD, so has never
needed to eagerly calculate the range boundaries (since Spark 1.0).

On Mon, Apr 25, 2016 at 2:04 AM, Praveen Devarao 
wrote:

> Thanks Reynold for the reason as to why sortBykey invokes a Job
>
> When you say "DataFrame/Dataset does not have this issue" is it right to
> assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in
> it?
>
> Thanking You
>
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:Reynold Xin 
> To:Praveen Devarao/India/IBM@IBMIN
> Cc:"d...@spark.apache.org" , user <
> user@spark.apache.org>
> Date:25/04/2016 11:26 am
> Subject:Re: Do transformation functions on RDD invoke a Job
> [sc.runJob]?
> --
>
>
>
> Usually no - but sortByKey does because it needs the range boundary to be
> built in order to have the RDD. It is a long standing problem that's
> unfortunately very difficult to solve without breaking the RDD API.
>
> In DataFrame/Dataset we don't have this issue though.
>
>
> On Sun, Apr 24, 2016 at 10:54 PM, Praveen Devarao <*praveen...@in.ibm.com*
> > wrote:
> Hi,
>
> I have a streaming program with the block as below [ref:
> *https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala*
> 
> ]
>
> *1 val **lines *= *messages*.map(_._2)
> *2 val **hashTags *= *lines*.flatMap(status => status.split(*" "*
> ).filter(_.startsWith(*"#"*)))
>
> *3 val **topCounts60 *= *hashTags*.map((_, 1)).reduceByKey( _ + _ )
> *3a* .map { *case *(topic, count) => (count, topic) }
> *3b* .transform(_.sortByKey(*false*))
>
> *4a**topCounts60*.foreachRDD( rdd => {
> *4b* *val *topList = rdd.take( 10 )
> })
>
> This batch is triggering 2 jobs...one at line *3b**(sortByKey)* and
> the other at *4b (rdd.take) *I agree that there is a Job triggered on
> line 4b as take() is an action on RDD while as on line 3b sortByKey is just
> a transformation function which as per docs is lazy evaluation...but I see
> that this line uses a RangePartitioner and Rangepartitioner on
> initialization invokes a method called *sketch() *that invokes *collect()*
> triggering a Job.
>
> My question: Is it expected that sortByKey will invoke a Job...if
> yes, why is sortByKey listed as a transformation and not action. Are there
> any other functions like this that invoke a Job, though they are
> transformations and not actions?
>
> I am on Spark 1.6
>
> Thanking You
>
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
>


Re: Dataset aggregateByKey equivalent

2016-04-23 Thread Michael Armbrust
Have you looked at aggregators?

https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html

On Fri, Apr 22, 2016 at 6:45 PM, Lee Becker  wrote:

> Is there a way to do aggregateByKey on Datasets the way one can on an RDD?
>
> Consider the following RDD code to build a set of KeyVals into a DataFrame
> containing a column with the KeyVals' keys and a column containing lists of
> KeyVals.  The end goal is to join it with collections which which will be
> similarly transformed.
>
> case class KeyVal(k: Int, v: Int)
>
>
> val keyVals = sc.parallelize(for (i <- 1 to 3; j <- 4 to 6) yield KeyVal(i,j))
>
> // function for appending to list
> val addToList = (s: List[KeyVal], v: KeyVal) => s :+ v
>
> // function for merging two lists
> val addLists = (s: List[KeyVal], t: List[KeyVal]) => s++t
>
> val keyAndKeyVals = keyVals.map(kv=> (kv.k, kv))
> val keyAndNestedKeyVals = keyAndKeyVals.
>   aggregateByKey(List[KeyVal]())(addToList, addLists).
>   toDF("key", "keyvals")
> keyAndNestedKeyVals.show
>
>
> which produces:
>
> +---++
> |key| keyvals|
> +---++
> |  1|[[1,4], [1,5], [1...|
> |  2|[[2,4], [2,5], [2...|
> |  3|[[3,4], [3,5], [3...|
> +---++
>
> For a Dataset approach I tried the following to no avail:
>
> // Initialize as Dataset
> val keyVals = sc.parallelize(for (i <- 1 to 3; j <- 4 to 6) yield 
> KeyVal(i,j)).
>   toDS
>
> // Build key, keyVal mappings
> val keyValsByKey = keyVals.groupBy(kv=>(kv.k))
>
> case class NestedKeyVal(key: Int, keyvals: List[KeyVal])
>
> val convertToNested = (key: Int, keyValsIter: Iterator[KeyVal]) => 
> NestedKeyVal(key=key, keyvals=keyValsIter.toList)
>
> val keyValsNestedByKey = keyValsByKey.mapGroups((key,keyvals) => 
> convertToNested(key,keyvals))
> keyValsNestedByKey.show
>
>
> This and several other incantations using groupBy + mapGroups consistently
> gives me serialization problems.  Is this because the iterator can not be
> guaranteed across boundaries?
> Or is there some issue with what a Dataset can encode in the interim.
> What other ways might I approach this problem?
>
> Thanks,
> Lee
>
>


[jira] [Resolved] (SPARK-14678) Add a file sink log to support versioning and compaction

2016-04-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14678.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12435
[https://github.com/apache/spark/pull/12435]

> Add a file sink log to support versioning and compaction
> 
>
> Key: SPARK-14678
> URL: https://issues.apache.org/jira/browse/SPARK-14678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> To use FileStreamSink in production, there are two requirements for 
> FileStreamSink's log:
> 1.Versioning. A future Spark version should be able to read the metadata of 
> an old FileStreamSink.
> 2. Compaction. As reading from many small files is usually pretty slow, we 
> should compact small metadata files into big files.
> See the PR description for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14767) Codegen "no constructor found" errors with Maps inside case classes in Datasets

2016-04-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14767:
-
Priority: Critical  (was: Major)

> Codegen "no constructor found" errors with Maps inside case classes in 
> Datasets
> ---
>
> Key: SPARK-14767
> URL: https://issues.apache.org/jira/browse/SPARK-14767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Priority: Critical
>
> When I have a `Map` inside a case class and am trying to use Datasets,
> the simplest operation throws an exception, because the generated code is 
> looking for a constructor with `scala.collection.Map` whereas the constructor 
> takes `scala.collection.immutable.Map`.
> To reproduce:
> {code}
> case class Bug(bug: Map[String, String])
> val ds = Seq(Bug(Map("name" -> "dummy"))).toDS()
> ds.map { b =>
>   b.bug.getOrElse("name", null)
> }.count()
> {code}
> Stacktrace:
> {code}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 163, Column 150: No applicable constructor/method 
> found for actual parameters "scala.collection.Map"; candidates are: 
> Bug(scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns

2016-04-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14741.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12517
[https://github.com/apache/spark/pull/12517]

> Streaming from partitioned directory structure captures unintended partition 
> columns
> 
>
> Key: SPARK-14741
> URL: https://issues.apache.org/jira/browse/SPARK-14741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> Consider the following directory structure
> dir/col=X/some-files
> If we create a text format streaming dataframe on {dir/col=X/}, then it 
> should not consider as partitioning in columns. Even though the streaming 
> dataframe does not do so, the generated batch dataframes pick up col as a 
> partitioning columns, causing mismatch streaming source schema and generated 
> df schema. This leads to runtime failure: 
> 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: 
> Query query-0 terminated with error
> java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14555) Python API for methods introduced for Structured Streaming

2016-04-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14555.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12320
[https://github.com/apache/spark/pull/12320]

> Python API for methods introduced for Structured Streaming
> --
>
> Key: SPARK-14555
> URL: https://issues.apache.org/jira/browse/SPARK-14555
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Streaming
>Reporter: Burak Yavuz
> Fix For: 2.0.0
>
>
> Methods added for Structured Streaming don't have a Python API yet.
> We need to provide APIs for the new methods in:
>  - DataFrameReader
>  - DataFrameWriter
>  - ContinuousQuery
>  - Trigger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13929) Use Scala reflection for UDFs

2016-04-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13929.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12149
[https://github.com/apache/spark/pull/12149]

> Use Scala reflection for UDFs
> -
>
> Key: SPARK-13929
> URL: https://issues.apache.org/jira/browse/SPARK-13929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jakob Odersky
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{ScalaReflection}} uses native Java reflection for User Defined Types which 
> would fail if such types are not plain Scala classes that map 1:1 to Java.
> Consider the following extract (from here 
> https://github.com/apache/spark/blob/92024797a4fad594b5314f3f3be5c6be2434de8a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L376
>  ):
> {code}
> case t if Utils.classIsLoadable(className) &&
> Utils.classForName(className).isAnnotationPresent(classOf[SQLUserDefinedType])
>  =>
> val udt = 
> Utils.classForName(className).getAnnotation(classOf[SQLUserDefinedType]).udt().newInstance()
> //...
> {code}
> If {{t}}'s runtime class is actually synthetic (something that doesn't exist 
> in Java and hence uses a dollar sign internally), such as nested classes or 
> package objects, the above code will fail.
> Currently there are no known use-cases of synthetic user-defined types (hence 
> the minor priority), however it would be best practice to remove plain Java 
> reflection and rely on Scala reflection instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: prefix column Spark

2016-04-19 Thread Michael Armbrust
A few comments:
 - Each withColumnRename is adding a new level to the logical plan.  We
have optimized this significantly in newer versions of Spark, but it is
still not free.
 - Transforming to an RDD is going to do fairly expensive conversion back
and forth between the internal binary format.
 - Probably the best way to accomplish this is to build up all the new
columns you want and pass them to a single select call.


On Tue, Apr 19, 2016 at 3:04 AM, nihed mbarek  wrote:

> Hi
> thank you, it's the first solution and it took a long time to manage all
> my fields
>
> Regards,
>
> On Tue, Apr 19, 2016 at 11:29 AM, Ndjido Ardo BAR 
> wrote:
>
>>
>> This can help:
>>
>> import org.apache.spark.sql.DataFrame
>>
>> def prefixDf(dataFrame: DataFrame, prefix: String): DataFrame = {
>>   val colNames = dataFrame.columns
>>   colNames.foldLeft(dataFrame){
>> (df, colName) => {
>>   df.withColumnRenamed(colName, s"${prefix}_${colName}")
>> }
>> }
>> }
>>
>> cheers,
>> Ardo
>>
>>
>> On Tue, Apr 19, 2016 at 10:53 AM, nihed mbarek  wrote:
>>
>>> Hi,
>>>
>>> I want to prefix a set of dataframes and I try two solutions:
>>> * A for loop calling withColumnRename based on columns()
>>> * transforming my Dataframe to and RDD, updating the old schema and
>>> recreating the dataframe.
>>>
>>>
>>> both are working for me, the second one is faster with tables that
>>> contain 800 columns but have a more stage of transformation toRDD.
>>>
>>> Is there any other solution?
>>>
>>> Thank you
>>>
>>> --
>>>
>>> M'BAREK Med Nihed,
>>> Fedora Ambassador, TUNISIA, Northern Africa
>>> http://www.nihed.com
>>>
>>> 
>>>
>>>
>>
>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>


Re: Will nested field performance improve?

2016-04-15 Thread Michael Armbrust
>
> If we expect fields nested in structs to always be much slower than flat
> fields, then I would be keen to address that in our ETL pipeline with a
> flattening step. If it's a known issue that we expect will be fixed in
> upcoming releases, I'll hold off.
>

The difference might be even larger in Spark 2.0 (because we really
optimize the simple case).  However, I would expect this to go away when we
fully columnarize the execution engine.  That could take a while though.


Re: Skipping Type Conversion and using InternalRows for UDF

2016-04-15 Thread Michael Armbrust
This would also probably improve performance:
https://github.com/apache/spark/pull/9565

On Fri, Apr 15, 2016 at 8:44 AM, Hamel Kothari 
wrote:

> Hi all,
>
> So we have these UDFs which take <1ms to operate and we're seeing pretty
> poor performance around them in practice, the overhead being >10ms for the
> projections (this data is deeply nested with ArrayTypes and MapTypes so
> that could be the cause). Looking at the logs and code for ScalaUDF, I
> noticed that there are a series of projections which take place before and
> after in order to make the Rows safe and then unsafe again. Is there any
> way to opt out of this and input/return InternalRows to skip the
> performance hit of the type conversion? It doesn't immediately appear to be
> possible but I'd like to make sure that I'm not missing anything.
>
> I suspect we could make this possible by checking if typetags in the
> register function are all internal types, if they are, passing a false
> value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF
> checking for that flag to figure out if the conversion process needs to
> take place. We're still left with the issue of missing a schema in the case
> of outputting InternalRows, but we could expose the DataType parameter
> rather than inferring it in the register function. Is there anything else
> in the code that would prevent this from working?
>
> Regards,
> Hamel
>


[jira] [Updated] (SPARK-14648) Spark EC2 script creates cluster but spark is not installed properly.

2016-04-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14648:
-
Assignee: Josh Rosen

> Spark EC2 script creates cluster but spark is not installed properly.
> -
>
> Key: SPARK-14648
> URL: https://issues.apache.org/jira/browse/SPARK-14648
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: Spark 1.6.1
>Reporter: Nikhil
>Assignee: Josh Rosen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Launched EC2 cluster using this code
> ./spark-ec2 --key-pair=spark --identity-file=spark.pem --region=us-east-1 
> --instance-type=c3.large -s 1 --copy-aws-credentials launch test-cluster
> Spark cluster created and running.
> After login, spark folder does not contain anything.
> ---
> When the script ran, here's where the problem was I believe. 
> Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’
> 100%[>]
>  277,258,240 55.8MB/s   in 4.4s   
> 2016-04-14 23:31:49 (59.8 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved 
> [277258240/277258240]
> Unpacking Spark
> gzip: stdin: not in gzip format
> tar: Child returned status 1
> tar: Error is not recoverable: exiting now
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> [timing] spark init:  00h 00m 06s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Strange bug: Filter problem with parenthesis

2016-04-13 Thread Michael Armbrust
You need to use `backticks` to reference columns that have non-standard
characters.

On Wed, Apr 13, 2016 at 6:56 AM,  wrote:

> Hi,
>
> I am debugging a program, and for some reason, a line calling the
> following is failing:
>
> df.filter("sum(OpenAccounts) > 5").show
>
> It says it cannot find the column *OpenAccounts*, as if it was applying
> the sum() function and looking for a column called like that, where there
> is not. This works fine if I rename the column to something without
> parenthesis.
>
> I can’t reproduce this issue in Spark Shell (1.6.0), any ideas on how can
> I analyze this? This is an aggregation result, with the default column
> names afterwards.
>
> PS: Workaround is to use toDF(cols) and rename all columns, but I am
> wondering if toDF has any impact on the RDD structure behind (e.g.
> repartitioning, cache, etc)
>
> Appreciated,
> Saif
>
>


Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
You don't need multiple contexts to do this:
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

On Tue, Apr 12, 2016 at 4:05 PM, Michael Segel 
wrote:

> Reading from multiple sources within the same application?
>
> How would you connect to Hive for some data and then reach out to lets say
> Oracle or DB2 for some other data that you may want but isn’t available on
> your cluster?
>
>
> On Apr 12, 2016, at 10:52 AM, Michael Armbrust 
> wrote:
>
> You can, but I'm not sure why you would want to.  If you want to isolate
> different users just use hiveContext.newSession().
>
> On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande 
> wrote:
>
>> Hi,
>>
>> Is it possible to have both a sqlContext and a hiveContext in the same
>> application ?
>>
>> If yes would there be any performance pernalties of doing so.
>>
>> Regards,
>> Natu
>>
>
>
>


[jira] [Updated] (SPARK-13753) Column nullable is derived incorrectly

2016-04-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13753:
-
Description: 
There is a problem in spark sql to derive nullable column and used in 
optimization incorrectly. In following query:
{code}
select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0]
  from (
select explode(map(a.frontend[0], 
ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
",action:", COALESCE(action, "null")), ".p50"),
 a.frontend[1], ARRAY(concat("metric:frontend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.backend[0], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p50"),
 a.backend[1], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.render[0], ARRAY(concat("metric:render", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p50"),
 a.render[1], ARRAY(concat("metric:render", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.page_load_time[0], 
ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p50"),
 a.page_load_time[1], 
ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p90"),
 a.total_load_time[0], 
ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p50"),
 a.total_load_time[1], 
ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags)
from (
  select  data.controller as controller, data.action as action,
  percentile(data.frontend, array(0.5, 0.9)) as 
frontend,
  percentile(data.backend, array(0.5, 0.9)) as backend,
  percentile(data.render, array(0.5, 0.9)) as render,
  percentile(data.page_load_time, array(0.5, 0.9)) as 
page_load_time,
  percentile(data.total_load_time, array(0.5, 0.9)) as 
total_load_time
  from air_events_rt
  where type='air_events' and data.event_name='pageload'
  group by data.controller, data.action
) a
  ) b
  where b.value is not null
{code}
b.value is incorrectly derived as not nullable.  "b.value is not null" 
predicate will be ignored by optimizer which cause the query return incorrect 
result. 


  was:
There is a problem in spark sql to derive nullable column and used in 
optimization incorrectly. In following query:
{code}
  select concat("perf.realtime.web", b.tags[1]) as metric, b.value, 
b.tags[0]
  from (
select explode(map(a.frontend[0], 
ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
",action:", COALESCE(action, "null")), ".p50"),
 a.frontend[1], ARRAY(concat("metric:frontend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.backend[0], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p50"),
 a.backend[1], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:", COALES

[jira] [Updated] (SPARK-13753) Column nullable is derived incorrectly

2016-04-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13753:
-
Description: 
There is a problem in spark sql to derive nullable column and used in 
optimization incorrectly. In following query:
{code}
  select concat("perf.realtime.web", b.tags[1]) as metric, b.value, 
b.tags[0]
  from (
select explode(map(a.frontend[0], 
ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
",action:", COALESCE(action, "null")), ".p50"),
 a.frontend[1], ARRAY(concat("metric:frontend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.backend[0], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p50"),
 a.backend[1], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.render[0], ARRAY(concat("metric:render", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p50"),
 a.render[1], ARRAY(concat("metric:render", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.page_load_time[0], 
ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p50"),
 a.page_load_time[1], 
ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p90"),
 a.total_load_time[0], 
ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p50"),
 a.total_load_time[1], 
ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
"null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags)
from (
  select  data.controller as controller, data.action as action,
  percentile(data.frontend, array(0.5, 0.9)) as 
frontend,
  percentile(data.backend, array(0.5, 0.9)) as backend,
  percentile(data.render, array(0.5, 0.9)) as render,
  percentile(data.page_load_time, array(0.5, 0.9)) as 
page_load_time,
  percentile(data.total_load_time, array(0.5, 0.9)) as 
total_load_time
  from air_events_rt
  where type='air_events' and data.event_name='pageload'
  group by data.controller, data.action
) a
  ) b
  where b.value is not null
{code}
b.value is incorrectly derived as not nullable.  "b.value is not null" 
predicate will be ignored by optimizer which cause the query return incorrect 
result. 


  was:
There is a problem in spark sql to derive nullable column and used in 
optimization incorrectly. In following query:
  select concat("perf.realtime.web", b.tags[1]) as metric, b.value, 
b.tags[0]
  from (
select explode(map(a.frontend[0], 
ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
",action:", COALESCE(action, "null")), ".p50"),
 a.frontend[1], ARRAY(concat("metric:frontend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p90"),
 a.backend[0], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
"null")), ".p50"),
 a.backend[1], ARRAY(concat("metric:backend", 
",controller:", COALESCE(controller, "null"), ",action:&q

[jira] [Updated] (SPARK-13753) Column nullable is derived incorrectly

2016-04-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13753:
-
Target Version/s: 2.0.0
Priority: Critical  (was: Major)

> Column nullable is derived incorrectly
> --
>
> Key: SPARK-13753
> URL: https://issues.apache.org/jira/browse/SPARK-13753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jingwei Lu
>Priority: Critical
>
> There is a problem in spark sql to derive nullable column and used in 
> optimization incorrectly. In following query:
> {code}
> select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0]
>   from (
> select explode(map(a.frontend[0], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p50"),
>  a.frontend[1], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p90"),
>  a.backend[0], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.backend[1], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.render[0], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.render[1], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.page_load_time[0], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.page_load_time[1], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"),
>  a.total_load_time[0], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.total_load_time[1], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags)
> from (
>   select  data.controller as controller, data.action as 
> action,
>   percentile(data.frontend, array(0.5, 0.9)) as 
> frontend,
>   percentile(data.backend, array(0.5, 0.9)) as 
> backend,
>   percentile(data.render, array(0.5, 0.9)) as render,
>   percentile(data.page_load_time, array(0.5, 0.9)) as 
> page_load_time,
>   percentile(data.total_load_time, array(0.5, 0.9)) 
> as total_load_time
>   from air_events_rt
>   where type='air_events' and data.event_name='pageload'
>   group by data.controller, data.action
> ) a
>   ) b
>   where b.value is not null
> {code}
> b.value is incorrectly derived as not nullable.  "b.value is not null" 
> predicate will be ignored by optimizer which cause the query return incorrect 
> result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Aggregator support in DataFrame

2016-04-12 Thread Michael Armbrust
Did you see these?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scala/typed.scala#L70

On Tue, Apr 12, 2016 at 9:46 AM, Koert Kuipers  wrote:

> i dont really see how Aggregator can be useful for DataFrame unless you
> can specify what columns it works on. Having to code Aggregators to always
> use Row and then extract the values yourself breaks the abstraction and
> makes it not much better than UserDefinedAggregateFunction (well... maybe
> still better because i have encoders so i can use kryo).
>
> On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers  wrote:
>
>> saw that, dont think it solves it. i basically want to add some children
>> to the expression i guess, to indicate what i am operating on? not sure if
>> even makes sense
>>
>> On Mon, Apr 11, 2016 at 8:04 PM, Michael Armbrust > > wrote:
>>
>>> I'll note this interface has changed recently:
>>> https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d
>>>
>>> I'm not sure that solves your problem though...
>>>
>>> On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers 
>>> wrote:
>>>
>>>> i like the Aggregator a lot
>>>> (org.apache.spark.sql.expressions.Aggregator), but i find the way to use it
>>>> somewhat confusing. I am supposed to simply call aggregator.toColumn, but
>>>> that doesn't allow me to specify which fields it operates on in a 
>>>> DataFrame.
>>>>
>>>> i would basically like to do something like
>>>> dataFrame
>>>>   .groupBy("k")
>>>>   .agg(
>>>> myAggregator.on("v1", "v2").toColumn,
>>>> myOtherAggregator.on("v3", "v4").toColumn
>>>>   )
>>>>
>>>
>>>
>>
>


Re: ordering over structs

2016-04-12 Thread Michael Armbrust
Does the data actually fit in memory?  Check the web ui.  If it doesn't
caching is not going to help you.

On Tue, Apr 12, 2016 at 9:00 AM, Imran Akbar  wrote:

> thanks Michael,
>
> That worked.
> But what's puzzling is if I take the exact same code and run it off a temp
> table created from parquet, vs. a cached table - it runs much slower.  5-10
> seconds uncached vs. 47-60 seconds cached.
>
> Any ideas why?
>
> Here's my code snippet:
> df = data.select("customer_id", struct('dt', 'product').alias("vs"))\
>   .groupBy("customer_id")\
>   .agg(min("vs").alias("final"))\
>   .select("customer_id", "final.dt", "final.product")
> df.head()
>
> My log from the non-cached run:
> http://pastebin.com/F88sSv1B
>
> Log from the cached run:
> http://pastebin.com/Pmmfea3d
>
> thanks,
> imran
>
> On Fri, Apr 8, 2016 at 12:33 PM, Michael Armbrust 
> wrote:
>
>> You need to use the struct function
>> <https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html#pyspark.sql.functions.struct>
>> (which creates an actual struct), you are trying to use the struct datatype
>> (which just represents the schema of a struct).
>>
>> On Thu, Apr 7, 2016 at 3:48 PM, Imran Akbar  wrote:
>>
>>> thanks Michael,
>>>
>>>
>>> I'm trying to implement the code in pyspark like so (where my dataframe
>>> has 3 columns - customer_id, dt, and product):
>>>
>>> st = StructType().add("dt", DateType(), True).add("product",
>>> StringType(), True)
>>>
>>> top = data.select("customer_id", st.alias('vs'))
>>>   .groupBy("customer_id")
>>>   .agg(max("dt").alias("vs"))
>>>   .select("customer_id", "vs.dt", "vs.product")
>>>
>>> But I get an error saying:
>>>
>>> AttributeError: 'StructType' object has no attribute 'alias'
>>>
>>> Can I do this without aliasing the struct?  Or am I doing something
>>> incorrectly?
>>>
>>>
>>> regards,
>>>
>>> imran
>>>
>>> On Wed, Apr 6, 2016 at 4:16 PM, Michael Armbrust >> > wrote:
>>>
>>>> Ordering for a struct goes in order of the fields.  So the max struct
>>>>> is the one with the highest TotalValue (and then the highest category
>>>>> if there are multiple entries with the same hour and total value).
>>>>>
>>>>> Is this due to "InterpretedOrdering" in StructType?
>>>>>
>>>>
>>>> That is one implementation, but the code generated ordering also
>>>> follows the same contract.
>>>>
>>>>
>>>>
>>>>>  4)  Is it faster doing it this way than doing a join or window
>>>>> function in Spark SQL?
>>>>>
>>>>> Way faster.  This is a very efficient way to calculate argmax.
>>>>>
>>>>> Can you explain how this is way faster than window function? I can
>>>>> understand join doesn't make sense in this case. But to calculate the
>>>>> grouping max, you just have to shuffle the data by grouping keys. You 
>>>>> maybe
>>>>> can do a combiner on the mapper side before shuffling, but that is it. Do
>>>>> you mean windowing function in Spark SQL won't do any map side combiner,
>>>>> even it is for max?
>>>>>
>>>>
>>>> Windowing can't do partial aggregation and will have to collect all the
>>>> data for a group so that it can be sorted before applying the function.  In
>>>> contrast a max aggregation will do partial aggregation (map side combining)
>>>> and can be calculated in a streaming fashion.
>>>>
>>>> Also, aggregation is more common and thus has seen more optimization
>>>> beyond the theoretical limits described above.
>>>>
>>>>
>>>
>>
>


[jira] [Resolved] (SPARK-14474) Move FileSource offset log into checkpointLocation

2016-04-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14474.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12247
[https://github.com/apache/spark/pull/12247]

> Move FileSource offset log into checkpointLocation
> --
>
> Key: SPARK-14474
> URL: https://issues.apache.org/jira/browse/SPARK-14474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Now that we have a single location for storing checkpointed state, propagate 
> this information into the source so that we don't have one random log off on 
> its own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
You can, but I'm not sure why you would want to.  If you want to isolate
different users just use hiveContext.newSession().

On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande 
wrote:

> Hi,
>
> Is it possible to have both a sqlContext and a hiveContext in the same
> application ?
>
> If yes would there be any performance pernalties of doing so.
>
> Regards,
> Natu
>


Re: Aggregator support in DataFrame

2016-04-11 Thread Michael Armbrust
I'll note this interface has changed recently:
https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d

I'm not sure that solves your problem though...

On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers  wrote:

> i like the Aggregator a lot (org.apache.spark.sql.expressions.Aggregator),
> but i find the way to use it somewhat confusing. I am supposed to simply
> call aggregator.toColumn, but that doesn't allow me to specify which fields
> it operates on in a DataFrame.
>
> i would basically like to do something like
> dataFrame
>   .groupBy("k")
>   .agg(
> myAggregator.on("v1", "v2").toColumn,
> myOtherAggregator.on("v3", "v4").toColumn
>   )
>


[jira] [Resolved] (SPARK-14494) Fix the race conditions in MemoryStream and MemorySink

2016-04-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14494.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12261
[https://github.com/apache/spark/pull/12261]

> Fix the race conditions in MemoryStream and MemorySink
> --
>
> Key: SPARK-14494
> URL: https://issues.apache.org/jira/browse/SPARK-14494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Make sure accessing mutable variables in MemoryStream and MemorySink are 
> protected by `synchronized`.
> This is probably why MemorySinkSuite failed here: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/650/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: ordering over structs

2016-04-08 Thread Michael Armbrust
You need to use the struct function
<https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html#pyspark.sql.functions.struct>
(which creates an actual struct), you are trying to use the struct datatype
(which just represents the schema of a struct).

On Thu, Apr 7, 2016 at 3:48 PM, Imran Akbar  wrote:

> thanks Michael,
>
>
> I'm trying to implement the code in pyspark like so (where my dataframe
> has 3 columns - customer_id, dt, and product):
>
> st = StructType().add("dt", DateType(), True).add("product", StringType(),
> True)
>
> top = data.select("customer_id", st.alias('vs'))
>   .groupBy("customer_id")
>   .agg(max("dt").alias("vs"))
>   .select("customer_id", "vs.dt", "vs.product")
>
> But I get an error saying:
>
> AttributeError: 'StructType' object has no attribute 'alias'
>
> Can I do this without aliasing the struct?  Or am I doing something
> incorrectly?
>
>
> regards,
>
> imran
>
> On Wed, Apr 6, 2016 at 4:16 PM, Michael Armbrust 
> wrote:
>
>> Ordering for a struct goes in order of the fields.  So the max struct is
>>> the one with the highest TotalValue (and then the highest category
>>>   if there are multiple entries with the same hour and total value).
>>>
>>> Is this due to "InterpretedOrdering" in StructType?
>>>
>>
>> That is one implementation, but the code generated ordering also follows
>> the same contract.
>>
>>
>>
>>>  4)  Is it faster doing it this way than doing a join or window function
>>> in Spark SQL?
>>>
>>> Way faster.  This is a very efficient way to calculate argmax.
>>>
>>> Can you explain how this is way faster than window function? I can
>>> understand join doesn't make sense in this case. But to calculate the
>>> grouping max, you just have to shuffle the data by grouping keys. You maybe
>>> can do a combiner on the mapper side before shuffling, but that is it. Do
>>> you mean windowing function in Spark SQL won't do any map side combiner,
>>> even it is for max?
>>>
>>
>> Windowing can't do partial aggregation and will have to collect all the
>> data for a group so that it can be sorted before applying the function.  In
>> contrast a max aggregation will do partial aggregation (map side combining)
>> and can be calculated in a streaming fashion.
>>
>> Also, aggregation is more common and thus has seen more optimization
>> beyond the theoretical limits described above.
>>
>>
>


[jira] [Created] (SPARK-14463) read.text broken for partitioned tables

2016-04-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-14463:


 Summary: read.text broken for partitioned tables
 Key: SPARK-14463
 URL: https://issues.apache.org/jira/browse/SPARK-14463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical


Strongly typing the return values of {{read.text}} as {{Dataset\[String]}} 
breaks when trying to load a partitioned table (or any table where the path 
looks partitioned)

{code}
Seq((1, "test"))
  .toDF("a", "b")
  .write
  .format("text")
  .partitionBy("a")
  .save("/home/michael/text-part-bug")

sqlContext.read.text("/home/michael/text-part-bug")
{code}

{code}
org.apache.spark.sql.AnalysisException: Try to map struct 
to Tuple1, but failed as the number of fields does not line up.
 - Input schema: struct
 - Target schema: struct;
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:265)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:279)
at org.apache.spark.sql.Dataset.(Dataset.scala:197)
at org.apache.spark.sql.Dataset.(Dataset.scala:168)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
at org.apache.spark.sql.Dataset.as(Dataset.scala:357)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:450)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14463) read.text broken for partitioned tables

2016-04-07 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230946#comment-15230946
 ] 

Michael Armbrust commented on SPARK-14463:
--

[~rxin]

> read.text broken for partitioned tables
> ---
>
> Key: SPARK-14463
> URL: https://issues.apache.org/jira/browse/SPARK-14463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Michael Armbrust
>Priority: Critical
>
> Strongly typing the return values of {{read.text}} as {{Dataset\[String]}} 
> breaks when trying to load a partitioned table (or any table where the path 
> looks partitioned)
> {code}
> Seq((1, "test"))
>   .toDF("a", "b")
>   .write
>   .format("text")
>   .partitionBy("a")
>   .save("/home/michael/text-part-bug")
> sqlContext.read.text("/home/michael/text-part-bug")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: Try to map struct 
> to Tuple1, but failed as the number of fields does not line up.
>  - Input schema: struct
>  - Target schema: struct;
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:265)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:279)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:197)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
>   at org.apache.spark.sql.Dataset.as(Dataset.scala:357)
>   at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:450)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14456) Remove unused variables and logics in DataSource

2016-04-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14456.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12237
[https://github.com/apache/spark/pull/12237]

> Remove unused variables and logics in DataSource
> 
>
> Key: SPARK-14456
> URL: https://issues.apache.org/jira/browse/SPARK-14456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Minor
> Fix For: 2.0.0
>
>
> In DataSource#write method, the variables `dataSchema` and `equality`, and 
> related logics are no longer used. Let's remove them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14449) SparkContext should use SparkListenerInterface

2016-04-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-14449:


 Summary: SparkContext should use SparkListenerInterface
 Key: SPARK-14449
 URL: https://issues.apache.org/jira/browse/SPARK-14449
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: ordering over structs

2016-04-06 Thread Michael Armbrust
>
> Ordering for a struct goes in order of the fields.  So the max struct is
> the one with the highest TotalValue (and then the highest category
>   if there are multiple entries with the same hour and total value).
>
> Is this due to "InterpretedOrdering" in StructType?
>

That is one implementation, but the code generated ordering also follows
the same contract.



>  4)  Is it faster doing it this way than doing a join or window function
> in Spark SQL?
>
> Way faster.  This is a very efficient way to calculate argmax.
>
> Can you explain how this is way faster than window function? I can
> understand join doesn't make sense in this case. But to calculate the
> grouping max, you just have to shuffle the data by grouping keys. You maybe
> can do a combiner on the mapper side before shuffling, but that is it. Do
> you mean windowing function in Spark SQL won't do any map side combiner,
> even it is for max?
>

Windowing can't do partial aggregation and will have to collect all the
data for a group so that it can be sorted before applying the function.  In
contrast a max aggregation will do partial aggregation (map side combining)
and can be calculated in a streaming fashion.

Also, aggregation is more common and thus has seen more optimization beyond
the theoretical limits described above.


Re: ordering over structs

2016-04-06 Thread Michael Armbrust
>
> 1)  Is a struct in Spark like a struct in C++?
>

Kinda.  Its an ordered collection of data with known names/types.


> 2)  What is an alias in this context?
>

it is assigning a name to the column.  similar to doing AS in sql.


> 3)  How does this code even work?
>

Ordering for a struct goes in order of the fields.  So the max struct is
the one with the highest TotalValue (and then the highest category if there
are multiple entries with the same hour and total value).


> 4)  Is it faster doing it this way than doing a join or window function in
> Spark SQL?
>

Way faster.  This is a very efficient way to calculate argmax.


Re: Using an Option[some primitive type] in Spark Dataset API

2016-04-06 Thread Michael Armbrust
> We only define implicits for a subset of the types we support in
> SQLImplicits
> .
> We should probably consider adding Option[T] for common T as the internal
> infrastructure does understand Option. You can workaround this by either
> creating a case class, using a Tuple or constructing the required
> implicit yourself
> 
>  (though
> this is using and internal API so may break in future releases).


Answered.

On Wed, Apr 6, 2016 at 4:42 AM, Razvan Panda  wrote:

> Copy paste from SO question: http://stackoverflow.com/q/36449368/750216
>
> "Is it possible to use Option[_] member in a case class used with Dataset
> API? eg. Option[Int]
>
> I tried to find an example but could not find any yet. This can probably
> be done with with a custom encoder (mapping?) but I could not find an
> example for that yet.
>
> This might be achievable using Frameless library:
> https://github.com/adelbertc/frameless but there should be an easy way to
> get it done with base Spark libraries.
>
> *Update*
>
> I am using: "org.apache.spark" %% "spark-core" % "1.6.1"
>
> And getting the following error when trying to use an Option[Int]:
>
> Unable to find encoder for type stored in a Dataset. Primitive types (Int,
> String, etc) and Product types (case classes) are supported by importing
> sqlContext.implicits._ Support for serializing other types will be added in
> future releases
>
> "
>
> Thank you
>
> --
> Razvan Panda
> .NET Developer | WhatClinic.com
> T: +353 1 6250520|
>  rpa...@whatclinic.com
>
> Global Medical Treatment Ltd, Trading as WhatClinic.com, 12 Duke Lane
> Upper, Dublin 2, Ireland. Company Registration Number: 428122
>
> ***Email Confidentiality & Disclaimer Notice - This email and any
> attachments is to be treated as confidential. If received in error, any
> use, dissemination, distribution, publication or copying of the information
> contained in this email or attachments is strictly prohibited. We cannot
> guarantee this email or any attachment is virus free and has not been
> intercepted and/or amended. The recipient should check this email and any
> attachments for the presence of viruses***
>


Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-04-06 Thread Michael Armbrust
>
> Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text]


Use toDS() and you can skip the .as[Text]


> Sure! It works with map, but not with select. Wonder if it's by design
> or...will soon be fixed? Thanks again for your help.


This is by design.  select is relational and works with column expressions.
 map is functional and works with lambda functions.

scala> ds.select('id.as[Int], 'text.as[String]).show
> +---+-+
> | _1|   _2|
> +---+-+
> |  0|hello|
> |  1|world|
> +---+-+
>

This is trickier.  In general we try to ensure that the type signatures for
typed functions always match the schema.  The typed version of select with
two arguments returns a tuple (and has to because the scala compiler does
not know the names of the columns you are specifying), so the schema should
be _1, _2.  If we propagated the names then  datasets with the same type
signature in scala would have different schema, which I think would be
confusing.


[jira] [Resolved] (SPARK-14411) Add a note to warn that onQueryProgress is asynchronous

2016-04-05 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14411.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12180
[https://github.com/apache/spark/pull/12180]

> Add a note to warn that onQueryProgress is asynchronous
> ---
>
> Key: SPARK-14411
> URL: https://issues.apache.org/jira/browse/SPARK-14411
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    4   5   6   7   8   9   10   11   12   13   >