from:"Michael Armbrust"

[jira] [Resolved] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14402.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12175
[https://github.com/apache/spark/pull/12175]

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust

The following should ensure partition pruning happens:

df.write.partitionBy("country").save("/path/to/data")
sqlContext.read.load("/path/to/data").where("country = 'UK'")

On Tue, Apr 5, 2016 at 1:13 PM, Darshan Singh 
wrote:

> Thanks for the reply.
>
> Now I saved the part_movies as parquet file.
>
> Then created new dataframe from the saved parquet file and I did not
> persist it. The i ran the same query. It still read all 20 partitions and
> this time from hdfs.
>
> So what will be exact scenario when it will prune partitions. I am bit
> confused now. Isnt there a way to see the exact partition pruning?
>
> Thanks
>
> On Tue, Apr 5, 2016 at 8:59 PM, Michael Armbrust 
> wrote:
>
>> For the in-memory cache, we still launch tasks, we just skip blocks when
>> possible using statistics about those blocks.
>>
>> On Tue, Apr 5, 2016 at 12:14 PM, Darshan Singh 
>> wrote:
>>
>>> Thanks. It is not my exact scenario but I have tried to reproduce it. I
>>> have used 1.5.2.
>>>
>>> I have a part-movies data-frame which has 20 partitions 1 each for a
>>> movie.
>>>
>>> I created following query
>>>
>>>
>>> val part_sql = sqlContext.sql("select * from part_movies where movie =
>>> 10")
>>> part_sql.count()
>>>
>>> I expect that this should just read from 1 partition i.e. partition 10.
>>> Other partitions it should max read metadata and not the data.
>>>
>>> here is physical plan. I could see the filter. From here i can not say
>>> whether this filter is causing any partition pruning. If actually pruning
>>> is happening i would like to see a operator which mentions the same.
>>>
>>> == Physical Plan ==
>>> TungstenAggregate(key=[], 
>>> functions=[(count(1),mode=Final,isDistinct=false)], output=[count#75L])
>>>  TungstenExchange SinglePartition
>>>   TungstenAggregate(key=[], 
>>> functions=[(count(1),mode=Partial,isDistinct=false)], 
>>> output=[currentCount#93L])
>>>Project
>>> Filter (movie#33 = 10)
>>>  InMemoryColumnarTableScan [movie#33], [(movie#33 = 10)], 
>>> (InMemoryRelation [movie#33,title#34,genres#35], true, 1, 
>>> StorageLevel(true, true, false, true, 1), (Scan 
>>> PhysicalRDD[movie#33,title#34,genres#35]), None)
>>>
>>>
>>> However, my assumption that partition is not pruned is not based on the
>>> above plan but when I look at the job and its stages. I could see that it
>>> has read full data of the dataframe.  I should see around 65KB as that is
>>> almost average size of each partition.
>>>
>>> Aggregated Metrics by Executor
>>> Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input
>>> Size / Records Shuffle Write Size / Records
>>> driver localhost:53247 0.4 s 20 0 20 1289.0 KB / 20 840.0 B / 20
>>>
>>>
>>> Task details only first 7. Here I expect that except 1 task(which access
>>> the partitions data) all others should be either 0 KB or just the size of
>>> metadata after which it discarded that partition as its data was not
>>> needed. But i could see that all the partitions are read.
>>>
>>> This is small example so it doesnt make diff but for a large dataframe
>>> reading all the data even that in memory takes time.
>>>
>>> Tasks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 0 27 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 39
>>> ms 12 ms 9 ms 0 ms 0 ms 0.0 B 66.2 KB (memory) / 1 42.0 B / 1
>>> 1 28 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 41
>>> ms 9 ms 7 ms 0 ms 0 ms 0.0 B 63.9 KB (memory) / 1 1 ms 42.0 B / 1
>>> 2 29 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 40
>>> ms 7 ms 7 ms 0 ms 0 ms 0.0 B 65.9 KB (memory) / 1 1 ms 42.0 B / 1
>>> 3 30 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 6 ms 3
>>> ms 5 ms 0 ms 0 ms 0.0 B 62.0 KB (memory) / 1 42.0 B / 1
>>> 4 31 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 4 ms 4
>>> ms 6 ms 1 ms 0 ms 0.0 B 69.2 KB (memory) / 1 42.0 B / 1
>>> 5 32 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 5 ms 2
>>> ms 5 ms 0 ms 0 ms 0.0 B 60.3 KB (memory) / 1 42.0 B / 1
>>> 6 33 0 S

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust

For the in-memory cache, we still launch tasks, we just skip blocks when
possible using statistics about those blocks.

On Tue, Apr 5, 2016 at 12:14 PM, Darshan Singh 
wrote:

> Thanks. It is not my exact scenario but I have tried to reproduce it. I
> have used 1.5.2.
>
> I have a part-movies data-frame which has 20 partitions 1 each for a movie.
>
> I created following query
>
>
> val part_sql = sqlContext.sql("select * from part_movies where movie = 10")
> part_sql.count()
>
> I expect that this should just read from 1 partition i.e. partition 10.
> Other partitions it should max read metadata and not the data.
>
> here is physical plan. I could see the filter. From here i can not say
> whether this filter is causing any partition pruning. If actually pruning
> is happening i would like to see a operator which mentions the same.
>
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#75L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#93L])
>Project
> Filter (movie#33 = 10)
>  InMemoryColumnarTableScan [movie#33], [(movie#33 = 10)], 
> (InMemoryRelation [movie#33,title#34,genres#35], true, 1, 
> StorageLevel(true, true, false, true, 1), (Scan 
> PhysicalRDD[movie#33,title#34,genres#35]), None)
>
>
> However, my assumption that partition is not pruned is not based on the
> above plan but when I look at the job and its stages. I could see that it
> has read full data of the dataframe.  I should see around 65KB as that is
> almost average size of each partition.
>
> Aggregated Metrics by Executor
> Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input
> Size / Records Shuffle Write Size / Records
> driver localhost:53247 0.4 s 20 0 20 1289.0 KB / 20 840.0 B / 20
>
>
> Task details only first 7. Here I expect that except 1 task(which access
> the partitions data) all others should be either 0 KB or just the size of
> metadata after which it discarded that partition as its data was not
> needed. But i could see that all the partitions are read.
>
> This is small example so it doesnt make diff but for a large dataframe
> reading all the data even that in memory takes time.
>
> Tasks
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 0 27 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 39 ms 12
> ms 9 ms 0 ms 0 ms 0.0 B 66.2 KB (memory) / 1 42.0 B / 1
> 1 28 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 41 ms 9
> ms 7 ms 0 ms 0 ms 0.0 B 63.9 KB (memory) / 1 1 ms 42.0 B / 1
> 2 29 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 40 ms 7
> ms 7 ms 0 ms 0 ms 0.0 B 65.9 KB (memory) / 1 1 ms 42.0 B / 1
> 3 30 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 6 ms 3
> ms 5 ms 0 ms 0 ms 0.0 B 62.0 KB (memory) / 1 42.0 B / 1
> 4 31 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 4 ms 4
> ms 6 ms 1 ms 0 ms 0.0 B 69.2 KB (memory) / 1 42.0 B / 1
> 5 32 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 5 ms 2
> ms 5 ms 0 ms 0 ms 0.0 B 60.3 KB (memory) / 1 42.0 B / 1
> 6 33 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 5 ms 3
> ms 4 ms 0 ms 0 ms 0.0 B 70.3 KB (memory) / 1 42.0 B / 1
> 7 34 0 SUCCESS PROCESS_LOCAL driver / localhost 2016/04/05 19:01:03 4 ms 5
> ms 4 ms 0 ms 0 ms 0.0 B 59.7 KB (memory) / 1 42.0 B / 1
>
> Let me know if you need anything else.
>
> Thanks
>
>
>
>
> On Tue, Apr 5, 2016 at 7:29 PM, Michael Armbrust 
> wrote:
>
>> Can you show your full code.  How are you partitioning the data? How are
>> you reading it?  What is the resulting query plan (run explain() or
>> EXPLAIN).
>>
>> On Tue, Apr 5, 2016 at 10:02 AM, dsing001  wrote:
>>
>>> HI,
>>>
>>> I am using 1.5.2. I have a dataframe which is partitioned based on the
>>> country. So I have around 150 partition in the dataframe. When I run
>>> sparksql and use country = 'UK' it still reads all partitions and not
>>> able
>>> to prune other partitions. Thus all the queries run for similar times
>>> independent of what country I pass. Is it desired?
>>>
>>> Is there a way to fix this in 1.5.2 by using some parameter or is it
>>> fixed
>>> in latest versions?
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Partition-pruning-in-spark-1-5-2-tp26682.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Michael Armbrust

You should generally think of a DataFrame as unordered, unless you are
explicitly asking for an order.  One way to order and assign an index is
with window functions

.

On Tue, Apr 5, 2016 at 4:17 AM, Angel Angel  wrote:

> Hello,
>
> i am writing one spark application i which i need the index of the maximum
> element.
>
> My table has one column only and i want the index of the maximum element.
>
> MAX(count)
> 23
> 32
> 3
> Here is my code the data type of the array is
> org.apache.spark.sql.Dataframe.
>
>
> Thanks in advance.
> Also please suggest me to do it in another way.
>
> [image: Inline image 1]
>

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust

Can you show your full code.  How are you partitioning the data? How are
you reading it?  What is the resulting query plan (run explain() or
EXPLAIN).

On Tue, Apr 5, 2016 at 10:02 AM, dsing001  wrote:

> HI,
>
> I am using 1.5.2. I have a dataframe which is partitioned based on the
> country. So I have around 150 partition in the dataframe. When I run
> sparksql and use country = 'UK' it still reads all partitions and not able
> to prune other partitions. Thus all the queries run for similar times
> independent of what country I pass. Is it desired?
>
> Is there a way to fix this in 1.5.2 by using some parameter or is it fixed
> in latest versions?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Partition-pruning-in-spark-1-5-2-tp26682.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

[jira] [Resolved] (SPARK-14257) Allow multiple continuous queries to be started from the same DataFrame

2016-04-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14257.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12049
[https://github.com/apache/spark/pull/12049]

> Allow multiple continuous queries to be started from the same DataFrame
> ---
>
> Key: SPARK-14257
> URL: https://issues.apache.org/jira/browse/SPARK-14257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Right now DataFrame stores Source in StreamingRelation and prevent from 
> reusing it among multiple continuous queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14402:
-
Target Version/s: 2.0.0

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14402:
-
Assignee: Dongjoon Hyun

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14402:
-
Labels: releasenotes  (was: )

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-05 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226779#comment-15226779
 ] 

Michael Armbrust commented on SPARK-14389:
--

Are you changing the broadcast threshold?

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: lineitem.tbl, sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14345) Decouple deserializer expression resolution from ObjectOperator

2016-04-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14345.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12131
[https://github.com/apache/spark/pull/12131]

> Decouple deserializer expression resolution from ObjectOperator
> ---
>
> Key: SPARK-14345
> URL: https://issues.apache.org/jira/browse/SPARK-14345
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-05 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226744#comment-15226744
 ] 

Michael Armbrust commented on SPARK-14389:
--

This is a little surprising since the larger relation should be streamed.  How 
big is the heap for your executors?

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: lineitem.tbl, sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14389:
-
Target Version/s: 2.0.0

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: lineitem.tbl, sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14287) Method to determine if Dataset is bounded or not

2016-04-04 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14287.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12080
[https://github.com/apache/spark/pull/12080]

> Method to determine if Dataset is bounded or not
> 
>
> Key: SPARK-14287
> URL: https://issues.apache.org/jira/browse/SPARK-14287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Burak Yavuz
> Fix For: 2.0.0
>
>
> With the addition of StreamExecution (ContinuousQuery) to Datasets, data will 
> become unbounded. With unbounded data, the execution of some methods and 
> operations will not make sense, e.g. Dataset.count().
> A simple API is required to check whether the data in a Dataset is bounded or 
> unbounded. This will allow users to check whether their Dataset is in 
> streaming mode or not. ML algorithms may check if the data is unbounded and 
> throw an exception for example.
> The implementation of this method is simple, however naming it is the 
> challenge. Some possible names for this method are:
>  - isStreaming
>  - isContinuous
>  - isBounded
>  - isUnbounded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [SQL] Dataset.map gives error: missing parameter type for expanded function?

2016-04-04 Thread Michael Armbrust

It is called groupByKey now.  Similar to joinWith, the schema produced by
relational joins and aggregations is different than what you would expect
when working with objects.  So, when combining DataFrame+Dataset we renamed
these functions to make this distinction clearer.

On Sun, Apr 3, 2016 at 12:23 PM, Jacek Laskowski  wrote:

> Hi,
>
> (since 2.0.0-SNAPSHOT it's more for dev not user)
>
> With today's master I'm getting the following:
>
> scala> ds
> res14: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
>
> // WHY?!
> scala> ds.groupBy(_._1)
> :26: error: missing parameter type for expanded function
> ((x$1) => x$1._1)
>ds.groupBy(_._1)
>   ^
>
> scala> ds.filter(_._1.size > 10)
> res23: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
>
> It's even on the slide of Michael in
> https://youtu.be/i7l3JQRx7Qw?t=7m38s from Spark Summit East?! Am I
> doing something wrong? Please guide.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

[jira] [Resolved] (SPARK-14176) Add processing time trigger

2016-04-04 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14176.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11976
[https://github.com/apache/spark/pull/11976]

> Add processing time trigger
> ---
>
> Key: SPARK-14176
> URL: https://issues.apache.org/jira/browse/SPARK-14176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Add a processing time trigger to control the batch processing speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Michael Armbrust

What error are you getting?  Here is an example

.

External types are documented here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types

On Fri, Apr 1, 2016 at 1:59 PM, Jerry Lam  wrote:

> Hi spark users and developers,
>
> Anyone tried to pass in an Array[Double] as a input to the UDF? I tried it
> for many hours reading spark sql code but IK still couldn't figure out a
> way to do this.
>
> Best Regards,
>
> Jerry
>

[jira] [Resolved] (SPARK-14255) Streaming Aggregation

2016-04-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14255.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12048
[https://github.com/apache/spark/pull/12048]

> Streaming Aggregation
> -
>
> Key: SPARK-14255
> URL: https://issues.apache.org/jira/browse/SPARK-14255
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Support for time column type?

2016-04-01 Thread Michael Armbrust

There is also CalendarIntervalType.  Is that what you are looking for?

On Fri, Apr 1, 2016 at 1:11 PM, Philip Weaver 
wrote:

> Hi, I don't see any mention of a time type in the documentation (there is
> DateType and TimestampType, but not TimeType), and have been unable to find
> any documentation about whether this will be supported in the future. Does
> anyone know if this is currently supported or will be supported in the
> future?
>

Re: What influences the space complexity of Spark operations?

2016-04-01 Thread Michael Armbrust

Blocking operators like Sort, Join or Aggregate will put all of the data
for a whole partition into a hash table or array.  However, if you are
running Spark 1.5+ we should be spilling to disk.  In Spark 1.6 if you are
seeing OOMs for SQL operations you should report it as a bug.

On Thu, Mar 31, 2016 at 9:26 AM, Steve Johnston  wrote:

> *What we’ve observed*
>
> Increasing the number of partitions (and thus decreasing the partition
> size) seems to reliably help avoid OOM errors. To demonstrate this we used
> a single executor and loaded a small table into a DataFrame, persisted it
> with MEMORY_AND_DISK, repartitioned it and joined it to itself. Varying the
> number of partitions identifies a threshold between completing the join and
> incurring an OOM error.
>
>
> lineitem = sc.textFile('lineitem.tbl').map(converter)
> lineitem = sqlContext.createDataFrame(lineitem, schema)
> lineitem.persist(StorageLevel.MEMORY_AND_DISK)
> repartitioned = lineitem.repartition(partition_count)
> joined = repartitioned.join(repartitioned)
> joined.show()
>
>
> *Questions*
>
> Generally, what influences the space complexity of Spark operations? Is it
> the case that a single partition of each operand’s data set + a single
> partition of the resulting data set all need to fit in memory at the same
> time? We can see where the transformations (for say joins) are implemented
> in the source code (for the example above BroadcastNestedLoopJoin), but
> they seem to be based on virtualized iterators; where in the code is the
> partition data for the inputs and outputs actually materialized?
> --
> View this message in context: What influences the space complexity of
> Spark operations?
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>

Re: Spark SQL UDF Returning Rows

2016-04-01 Thread Michael Armbrust

>
> I haven't looked at Encoders or Datasets since we're bound to 1.6 for now
> but I'll look at encoders to see if that covers it. Datasets seems like it
> would solve this problem for sure.
>

There is an experimental preview of Datasets in Spark 1.6


> I avoided returning a case object because even if we use reflection to
> build byte code and do it efficiently. I still need to convert my Row to a
> case object manually within my UDF, just to have it converted to a Row
> again. Even if it's fast, it's still fairly necessary.
>

Even if you give us a Row there's still a conversion into the binary format
of InternalRow


> The thing I guess that threw me off was that UDF1/2/3 was in a "java"
> prefixed package although there was nothing that made it java specific and
> in fact was the only way to do what I wanted in scala. For things like
> JavaRDD, etc it makes sense, but for generic things like UDF is there a
> reason they get put into a package with "java" in the name?
>

This was before we decided to unify the APIs for Scala and Java, so its
mostly historical.

[jira] [Updated] (SPARK-14160) Windowing for structured streaming

2016-04-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14160:
-
Assignee: Burak Yavuz

> Windowing for structured streaming
> --
>
> Key: SPARK-14160
> URL: https://issues.apache.org/jira/browse/SPARK-14160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.0
>
>
> This JIRA is to track the status regarding event time windowing operations 
> for Continuous queries.
> The proposition is to add a 
> {code}
> window(timeColumn, windowDuration, slideDuration, startTime)
> {code} expression that will bucket time columns into time windows. This 
> expression will be useful in both batch analysis and streaming. With 
> streaming, it will open up the use case for event-time window aggregations.
> For R, and Python interoperability, we will take windowDuration, 
> slideDuration, startTime as strings and parse interval lengths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14160) Windowing for structured streaming

2016-04-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14160.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12008
[https://github.com/apache/spark/pull/12008]

> Windowing for structured streaming
> --
>
> Key: SPARK-14160
> URL: https://issues.apache.org/jira/browse/SPARK-14160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Burak Yavuz
> Fix For: 2.0.0
>
>
> This JIRA is to track the status regarding event time windowing operations 
> for Continuous queries.
> The proposition is to add a 
> {code}
> window(timeColumn, windowDuration, slideDuration, startTime)
> {code} expression that will bucket time columns into time windows. This 
> expression will be useful in both batch analysis and streaming. With 
> streaming, it will open up the use case for event-time window aggregations.
> For R, and Python interoperability, we will take windowDuration, 
> slideDuration, startTime as strings and parse interval lengths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-04-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14070.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11891
[https://github.com/apache/spark/pull/11891]

> Use ORC data source for SQL queries on ORC tables
> -
>
> Key: SPARK-14070
> URL: https://issues.apache.org/jira/browse/SPARK-14070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently if one is trying to query ORC tables in Hive, the plan generated by 
> Spark hows that its using the `HiveTableScan` operator which is generic to 
> all file formats. We could instead use the ORC data source for this so that 
> we can get ORC specific optimizations like predicate pushdown.
> Current behaviour:
> ```
> scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*, None)]
> +- 'UnresolvedRelation `orc_table`, None
> == Analyzed Logical Plan ==
> key: string, value: string
> Project [key#171,value#172]
> +- MetastoreRelation default, orc_table, None
> == Optimized Logical Plan ==
> MetastoreRelation default, orc_table, None
> == Physical Plan ==
> HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14191) Fix Expand operator constraints

2016-04-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14191.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11995
[https://github.com/apache/spark/pull/11995]

> Fix Expand operator constraints
> ---
>
> Key: SPARK-14191
> URL: https://issues.apache.org/jira/browse/SPARK-14191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Expand operator now uses its child plan's constraints as its valid 
> constraints (i.e., the base of constraints). This is not correct because 
> Expand will set its group by attributes to null values. So the nullability of 
> these attributes should be true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13995) Extract correct IsNotNull constraints for Expression

2016-04-01 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13995.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11809
[https://github.com/apache/spark/pull/11809]

> Extract correct IsNotNull constraints for Expression
> 
>
> Key: SPARK-13995
> URL: https://issues.apache.org/jira/browse/SPARK-13995
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> We infer relative `IsNotNull` constraints from logical plan's expressions in 
> `constructIsNotNullConstraints` now. However, we don't consider the case of 
> (nested) `Cast`.
> For example:
> val tr = LocalRelation('a.int, 'b.long)
> val plan = tr.where('a.attr === 'b.attr).analyze
> Then, the plan's constraints will have `IsNotNull(Cast(resolveColumn(tr, 
> "a"), LongType))`, instead of `IsNotNull(resolveColumn(tr, "a"))`. This PR 
> fixes it.
> Besides, as `IsNotNull` constraints are most useful for `Attribute`, we 
> should do recursing through any `Expression` that is null intolerant and 
> construct `IsNotNull` constraints for all `Attribute`s under these 
> Expressions.
> For example, consider the following constraints:
> val df = Seq((1,2,3)).toDF("a", "b", "c")
> df.where("a + b = c").queryExecution.analyzed.constraints
> The inferred isnotnull constraints should be isnotnull(a), isnotnull(b), 
> isnotnull(c), instead of isnotnull(a + c) and isnotnull(c).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14288) Memory Sink

2016-03-30 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-14288:


 Summary: Memory Sink
 Key: SPARK-14288
 URL: https://issues.apache.org/jira/browse/SPARK-14288
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: pyspark read json file with high dimensional sparse data

2016-03-30 Thread Michael Armbrust

You can force the data to be loaded as a sparse map assuming the key/value
types are consistent.  Here is an example

.

On Wed, Mar 30, 2016 at 8:17 AM, Yavuz Nuzumlalı 
wrote:

> Hi all,
>
> I'm trying to read a data inside a json file using
> `SQLContext.read.json()` method.
>
> However, reading operation does not finish. My data is of 29x3100
> dimensions, but it's actually really sparse, so if there is a way to
> directly read json into a sparse dataframe, it would work perfect for me.
>
> What are the alternatives for reading such data into spark?
>
> P.S. : When I try to load first 5 rows, read operation is completed in
> ~2 minutes.
>

Re: Spark SQL UDF Returning Rows

2016-03-30 Thread Michael Armbrust

Some answers and more questions inline

- UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects
> as parameters. I cannot take in a case class object in place of the
> corresponding Row object, even if the schema matches because the Row object
> will always be passed in at Runtime and it will yield a ClassCastException.


This is true today, but could be improved using the new encoder framework.
Out of curiosity, have you look at that?  If so, what is missing thats
leading you back to UDFs.

Is there any way to return a Row object in scala from a UDF and specify the
> known schema that would be returned at UDF registration time? In
> python/java this seems to be the case because you need to explicitly
> specify return DataType of your UDF but using scala functions this isn't
> possible. I guess I could use the Java UDF1/2/3... API but I wanted to see
> if there was a first class scala way to do this.


I think UDF1/2/3 are the only way to do this today.  Is the problem here
that you are only changing a subset of the nested data and you want to
preserve the structure.  What kind of changes are you doing?

2) Is Spark actually converting the returned case class object when the UDF
>> is called, or does it use the fact that it's essentially "Product" to
>> efficiently coerce it to a Row in some way?
>>
>
We use reflection to figure out the schema and extract the data into the
internal row format.  We actually runtime build bytecode for this in many
cases (though not all yet) so it can be pretty fast.


> 2.1) If this is the case, we could just take in a case object as a
> parameter (rather than a Row) and perform manipulation on that and return
> it. Is this explicitly something we avoided?


You can do this with Datasets:

df.as[CaseClass].map(o => do stuff)

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Michael Armbrust

+1 to Matei's reasoning.

On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia 
wrote:

> I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the
> entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's
> the default version we built with in 1.x. We want to make the transition
> from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default
> downloads be for Scala 2.11, so people will more easily move, but we
> shouldn't create obstacles that lead to fragmenting the community and
> slowing down Spark 2.0's adoption. I've seen companies that stayed on an
> old Scala version for multiple years because switching it, or mixing
> versions, would affect the company's entire codebase.
>
> Matei
>
> On Mar 30, 2016, at 12:08 PM, Koert Kuipers  wrote:
>
> oh wow, had no idea it got ripped out
>
> On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra 
> wrote:
>
>> No, with 2.0 Spark really doesn't use Akka:
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744
>>
>> On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers  wrote:
>>
>>> Spark still runs on akka. So if you want the benefits of the latest akka
>>> (not saying we do, was just an example) then you need to drop scala 2.10
>>> On Mar 30, 2016 10:44 AM, "Cody Koeninger"  wrote:
>>>
 I agree with Mark in that I don't see how supporting scala 2.10 for
 spark 2.0 implies supporting it for all of spark 2.x

 Regarding Koert's comment on akka, I thought all akka dependencies
 have been removed from spark after SPARK-7997 and the recent removal
 of external/akka

 On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra 
 wrote:
 > Dropping Scala 2.10 support has to happen at some point, so I'm not
 > fundamentally opposed to the idea; but I've got questions about how
 we go
 > about making the change and what degree of negative consequences we
 are
 > willing to accept.  Until now, we have been saying that 2.10 support
 will be
 > continued in Spark 2.0.0.  Switching to 2.11 will be non-trivial for
 some
 > Spark users, so abruptly dropping 2.10 support is very likely to delay
 > migration to Spark 2.0 for those users.
 >
 > What about continuing 2.10 support in 2.0.x, but repeatedly making an
 > obvious announcement in multiple places that such support is
 deprecated,
 > that we are not committed to maintaining it throughout 2.x, and that
 it is,
 > in fact, scheduled to be removed in 2.1.0?
 >
 > On Wed, Mar 30, 2016 at 7:45 AM, Sean Owen 
 wrote:
 >>
 >> (This should fork as its own thread, though it began during
 discussion
 >> of whether to continue Java 7 support in Spark 2.x.)
 >>
 >> Simply: would like to more clearly take the temperature of all
 >> interested parties about whether to support Scala 2.10 in the Spark
 >> 2.x lifecycle. Some of the arguments appear to be:
 >>
 >> Pro
 >> - Some third party dependencies do not support Scala 2.11+ yet and so
 >> would not be usable in a Spark app
 >>
 >> Con
 >> - Lower maintenance overhead -- no separate 2.10 build,
 >> cross-building, tests to check, esp considering support of 2.12 will
 >> be needed
 >> - Can use 2.11+ features freely
 >> - 2.10 was EOL in late 2014 and Spark 2.x lifecycle is years to come
 >>
 >> I would like to not support 2.10 for Spark 2.x, myself.
 >>
 >> -
 >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 >> For additional commands, e-mail: dev-h...@spark.apache.org
 >>
 >

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>
>
>

[jira] [Resolved] (SPARK-14268) rename toRowExpressions and fromRowExpression to serializer and deserializer in ExpressionEncoder

2016-03-30 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14268.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12058
[https://github.com/apache/spark/pull/12058]

> rename toRowExpressions and fromRowExpression to serializer and deserializer 
> in ExpressionEncoder
> -
>
> Key: SPARK-14268
> URL: https://issues.apache.org/jira/browse/SPARK-14268
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14255) Streaming Aggregation

2016-03-29 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-14255:


 Summary: Streaming Aggregation
 Key: SPARK-14255
 URL: https://issues.apache.org/jira/browse/SPARK-14255
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13531:
-
Priority: Major  (was: Minor)

> Some DataFrame joins stopped working with UnsupportedOperationException: No 
> size estimation available for objects
> -
>
> Key: SPARK-13531
> URL: https://issues.apache.org/jira/browse/SPARK-13531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>
> this is using spark 2.0.0-SNAPSHOT
> dataframe df1:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , obj#135: object, [if (input[0, object].isNullAt) 
> null else input[0, object].get AS x#128]
> +- MapPartitions , createexternalrow(if (isnull(x#9)) null else 
> x#9), [input[0, object] AS obj#135]
>+- WholeStageCodegen
>   :  +- Project [_1#8 AS x#9]
>   : +- Scan ExistingRDD[_1#8]{noformat}
> show:
> {noformat}+---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+{noformat}
> dataframe df2:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true), 
> StructField(y,StringType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else 
> y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get 
> AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> object].get, true) AS y#131]
> +- WholeStageCodegen
>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>: +- Scan ExistingRDD[_1#0,_2#1]{noformat}
> show:
> {noformat}+---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+{noformat}
> i run:
> df1.join(df2, Seq("x")).show
> i get:
> {noformat}java.lang.UnsupportedOperationException: No size estimation 
> available for objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat}
> now sure what changed, this ran about a week ago without issues (in our 
> internal unit tests). it is fully reproducible, however when i tried to 
> minimize the issue i could not reproduce it by just creating data frames in 
> the repl with the same contents, so it probably has something to do with way 
> these are created (from Row objects and StructTypes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust

Oh, I'm sorry I didn't fully understand what you were trying to do.  If you
don't need partitioning, you can set
"spark.sql.sources.partitionDiscovery.enabled=false".  Otherwise, I think
you need to use the unioning approach.

On Fri, Mar 25, 2016 at 1:35 PM, Spencer Uresk  wrote:

> Thanks for the suggestion - I didn't try it at first because it seems like
> I have multiple roots and not necessarily partitioned data. Is this the
> correct way to do that?
>
> sqlContext.read.option("basePath",
> "hdfs://user/hdfs/analytics/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*")
>
> If so, it returns the same error:
>
> java.lang.AssertionError: assertion failed: Conflicting directory
> structures detected. Suspicious paths:?
> hdfs://user/hdfs/analytics/app1/PAGEVIEW
> hdfs://user/hdfs/analytics/app2/PAGEVIEW
>
> On Fri, Mar 25, 2016 at 2:00 PM, Michael Armbrust 
> wrote:
>
>> Have you tried setting a base path for partition discovery?
>>
>> Starting from Spark 1.6.0, partition discovery only finds partitions
>>> under the given paths by default. For the above example, if users pass
>>> path/to/table/gender=male to either SQLContext.read.parquet or
>>> SQLContext.read.load, gender will not be considered as a partitioning
>>> column. If users need to specify the base path that partition discovery
>>> should start with, they can set basePath in the data source options.
>>> For example, when path/to/table/gender=male is the path of the data and
>>> users set basePath to path/to/table/, gender will be a partitioning
>>> column.
>>
>>
>>
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery
>>
>>
>>
>> On Fri, Mar 25, 2016 at 10:34 AM, Ted Yu  wrote:
>>
>>> This is the original subject of the JIRA:
>>> Partition discovery fail if there is a _SUCCESS file in the table's root
>>> dir
>>>
>>> If I remember correctly, there were discussions on how (traditional)
>>> partition discovery slowed down Spark jobs.
>>>
>>> Cheers
>>>
>>> On Fri, Mar 25, 2016 at 10:15 AM, suresk  wrote:
>>>
>>>> In previous versions of Spark, this would work:
>>>>
>>>> val events =
>>>> sqlContext.jsonFile("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*")
>>>>
>>>> Where the first wildcard corresponds to an application directory, the
>>>> second
>>>> to a partition directory, and the third matched all the files in the
>>>> partition directory. The records are all the exact same format, they are
>>>> just broken out by application first, then event type. This
>>>> functionality
>>>> was really useful.
>>>>
>>>> In 1.6, this same call results in the following error:
>>>>
>>>> Conflicting directory structures detected. Suspicious paths:
>>>> (list of paths)
>>>>
>>>> And then it recommends reading in each root directory separately and
>>>> unioning them together. It looks like the change happened here:
>>>>
>>>> https://github.com/apache/spark/pull/9651
>>>>
>>>> 1) Simply out of curiosity, since I'm still fairly new to Spark - what
>>>> is
>>>> the benefit of no longer allowing multiple roots?
>>>>
>>>> 2) Is there a better way to do what I'm trying to do? Discovering all
>>>> of the
>>>> paths (I won't know them ahead of time), creating tables for each of
>>>> them,
>>>> and then doing all of the unions seems inefficient and a lot of extra
>>>> work
>>>> compared to what I had before.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-and-multiple-roots-in-1-6-tp26598.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Michael Armbrust

I would not recommend using the direct output committer with HDFS.  Its
intended only as an optimization for S3.

On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar  wrote:

> Hi,
>
> We are doing the following to save a dataframe in parquet (using
> DirectParquetOutputCommitter) as follows.
>
> dfWriter.format("parquet")
>   .mode(SaveMode.Overwrite)
>   .save(outputPath)
>
> The problem is even if an executor fails once while writing file (say some
> transient HDFS issue), when its re-spawn, it fails again because the file
> exists already, eventually failing the entire job.
>
> Is this a known issue? Any workarounds?
>
> Thanks
> Vinoth
>

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust

Have you tried setting a base path for partition discovery?

Starting from Spark 1.6.0, partition discovery only finds partitions under
> the given paths by default. For the above example, if users pass
> path/to/table/gender=male to either SQLContext.read.parquet or
> SQLContext.read.load, gender will not be considered as a partitioning
> column. If users need to specify the base path that partition discovery
> should start with, they can set basePath in the data source options. For
> example, when path/to/table/gender=male is the path of the data and users
> set basePath to path/to/table/, gender will be a partitioning column.


http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery



On Fri, Mar 25, 2016 at 10:34 AM, Ted Yu  wrote:

> This is the original subject of the JIRA:
> Partition discovery fail if there is a _SUCCESS file in the table's root
> dir
>
> If I remember correctly, there were discussions on how (traditional)
> partition discovery slowed down Spark jobs.
>
> Cheers
>
> On Fri, Mar 25, 2016 at 10:15 AM, suresk  wrote:
>
>> In previous versions of Spark, this would work:
>>
>> val events =
>> sqlContext.jsonFile("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*")
>>
>> Where the first wildcard corresponds to an application directory, the
>> second
>> to a partition directory, and the third matched all the files in the
>> partition directory. The records are all the exact same format, they are
>> just broken out by application first, then event type. This functionality
>> was really useful.
>>
>> In 1.6, this same call results in the following error:
>>
>> Conflicting directory structures detected. Suspicious paths:
>> (list of paths)
>>
>> And then it recommends reading in each root directory separately and
>> unioning them together. It looks like the change happened here:
>>
>> https://github.com/apache/spark/pull/9651
>>
>> 1) Simply out of curiosity, since I'm still fairly new to Spark - what is
>> the benefit of no longer allowing multiple roots?
>>
>> 2) Is there a better way to do what I'm trying to do? Discovering all of
>> the
>> paths (I won't know them ahead of time), creating tables for each of them,
>> and then doing all of the unions seems inefficient and a lot of extra work
>> compared to what I had before.
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-and-multiple-roots-in-1-6-tp26598.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

[jira] [Resolved] (SPARK-12443) encoderFor should support Decimal

2016-03-25 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12443.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10399
[https://github.com/apache/spark/pull/10399]

> encoderFor should support Decimal
> -
>
> Key: SPARK-12443
> URL: https://issues.apache.org/jira/browse/SPARK-12443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-03-25 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14048:
-
Target Version/s: 2.0.0

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Michael Armbrust

On Thu, Mar 24, 2016 at 4:54 PM, Mark Hamstra 
 wrote:

> It's a pain in the ass.  Especially if some of your transitive
> dependencies never upgraded from 2.10 to 2.11.
>

Yeah, I'm going to have to agree here.  It is not as bad as it was in the
2.9 days, but its still non-trivial due to the eco-system part of it.  For
this reason I think that it is premature to drop support for 2.10.x.

Re: Column explode a map

2016-03-24 Thread Michael Armbrust

If you know the map keys ahead of time then you can just extract them
directly.

Here are a few examples

.

On Thu, Mar 24, 2016 at 12:01 PM, Michał Zieliński <
zielinski.mich...@gmail.com> wrote:

> Hi,
>
> Imagine you have a structure like this:
>
> val events = sqlContext.createDataFrame(
>Seq(
>  ("a", Map("a"->1,"b"->1)),
>  ("b", Map("b"->1,"c"->1)),
>  ("c", Map("a"->1,"c"->1))
>)
>  ).toDF("id","map")
>
> What I want to achieve is have the map values as a separate columns.
> Basically I want to achieve this:
>
> +---++++
> | id|   a|   b|   c|
> +---++++
> |  a|   1|   1|null|
> |  b|null|   1|   1|
> |  c|   1|null|   1|
> +---++++
>
> I managed to create it with an explode-pivot combo, but for large dataset,
> and a list of map keys around 1000 I imagine this will
> be prohibitively expensive. I reckon there must be a much easier way to
> achieve that, than:
>
> val exploded =
> events.select(col("id"),explode(col("map"))).groupBy("id").pivot("key").sum("value")
>
> Any help would be appreciated. :)
>

Re: calling individual columns from spark temporary table

2016-03-24 Thread Michael Armbrust

df.filter(col("paid") > "").select(col("name1").as("newName"), ...)

On Wed, Mar 23, 2016 at 6:17 PM, Ashok Kumar  wrote:

> Thank you again
>
> For
>
> val r = df.filter(col("paid") > "").map(x =>
> (x.getString(0),x.getString(1).)
>
> Can you give an example of column expression please
>  like
>
> df.filter(col("paid") > "").col("firstcolumn").getString   ?
>
>
>
>
> On Thursday, 24 March 2016, 0:45, Michael Armbrust 
> wrote:
>
>
> You can only use as on a Column expression, not inside of a lambda
> function.  The reason is the lambda function is compiled into opaque
> bytecode that Spark SQL is not able to see.  We just blindly execute it.
>
> However, there are a couple of ways to name the columns that come out of a
> map.  Either use a case class instead of a tuple.  Or use .toDF("name1",
> "name2") after the map.
>
> From a performance perspective, its even better though if you can avoid
> maps and stick to Column expressions.  The reason is that for maps, we have
> to actually materialize and object to pass to your function.  However, if
> you stick to column expression we can actually work directly on serialized
> data.
>
> On Wed, Mar 23, 2016 at 5:27 PM, Ashok Kumar  wrote:
>
> thank you sir
>
> sql("select `_1` as firstcolumn from items")
>
> is there anyway one can keep the csv column names using databricks when
> mapping
>
> val r = df.filter(col("paid") > "").map(x =>
> (x.getString(0),x.getString(1).)
>
> can I call example  x.getString(0).as.(firstcolumn) in above when mapping
> if possible so columns will have labels
>
>
>
>
>
> On Thursday, 24 March 2016, 0:18, Michael Armbrust 
> wrote:
>
>
> You probably need to use `backticks` to escape `_1` since I don't think
> that its a valid SQL identifier.
>
> On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar  > wrote:
>
> Gurus,
>
> If I register a temporary table as below
>
>  r.toDF
> res58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3:
> double, _4: double, _5: double]
>
> r.toDF.registerTempTable("items")
>
> sql("select * from items")
> res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3:
> double, _4: double, _5: double]
>
> Is there anyway I can do a select on the first column only
>
> sql("select _1 from items" throws error
>
> Thanking you
>
>
>
>
>
>
>
>

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-24 Thread Michael Armbrust

Patrick is investigating.

On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Just checking in on this again as the builds on S3 are still broken. :/
>
> Could it have something to do with us moving release-build.sh
> <https://github.com/apache/spark/commits/master/dev/create-release/release-build.sh>
> ?
> 
>
> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Is someone going to retry fixing these packages? It's still a problem.
>>
>> Also, it would be good to understand why this is happening.
>>
>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky  wrote:
>>
>>> I just realized you're using a different download site. Sorry for the
>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>> Hadoop 2.6 is
>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>>  wrote:
>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a corrupt
>>> ZIP
>>> > file.
>>> >
>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
>>> Spark
>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>> >
>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky 
>>> wrote:
>>> >>
>>> >> I just experienced the issue, however retrying the download a second
>>> >> time worked. Could it be that there is some load balancer/cache in
>>> >> front of the archive and some nodes still serve the corrupt packages?
>>> >>
>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>>> >>  wrote:
>>> >> > I'm seeing the same. :(
>>> >> >
>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu 
>>> wrote:
>>> >> >>
>>> >> >> I tried again this morning :
>>> >> >>
>>> >> >> $ wget
>>> >> >>
>>> >> >>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >> --2016-03-18 07:55:30--
>>> >> >>
>>> >> >>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>>> >> >> ...
>>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>
>>> >> >> gzip: stdin: unexpected end of file
>>> >> >> tar: Unexpected EOF in archive
>>> >> >> tar: Unexpected EOF in archive
>>> >> >> tar: Error is not recoverable: exiting now
>>> >> >>
>>> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>>> >> >> 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>>> >> >>>
>>> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>>> >> >>> 
>>> >> >>> wrote:
>>> >> >>>>
>>> >> >>>> Looks like the other packages may also be corrupt. I’m getting
>>> the
>>> >> >>>> same
>>> >> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>> >> >>>>
>>> >> >>>> Nick
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu 
>>> wrote:
>>> >> >>>>>
>>> >> >>>>> On Linux, I got:
>>> >> >>>>>
>>> >> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>>>>
>>> >> >>>>> gzip: stdin: unexpected end of file
>>> >> >>>>> tar: Unexpected EOF in archive
>>> >> >>>>> tar: Unexpected EOF in archive
>>> >> >>>>> tar: Error is not recoverable: exiting now
>>> >> >>>>>
>>> >> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>>> >> >>>>>  wrote:
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>>>>>
>>> >> >>>>>> Does anyone else have trouble unzipping this? How did this
>>> happen?
>>> >> >>>>>>
>>> >> >>>>>> What I get is:
>>> >> >>>>>>
>>> >> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>> >> >>>>>>
>>> >> >>>>>> Seems like a strange type of problem to come across.
>>> >> >>>>>>
>>> >> >>>>>> Nick
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>
>>> >> >
>>>
>>

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust

You can only use as on a Column expression, not inside of a lambda
function.  The reason is the lambda function is compiled into opaque
bytecode that Spark SQL is not able to see.  We just blindly execute it.

However, there are a couple of ways to name the columns that come out of a
map.  Either use a case class instead of a tuple.  Or use .toDF("name1",
"name2") after the map.

>From a performance perspective, its even better though if you can avoid
maps and stick to Column expressions.  The reason is that for maps, we have
to actually materialize and object to pass to your function.  However, if
you stick to column expression we can actually work directly on serialized
data.

On Wed, Mar 23, 2016 at 5:27 PM, Ashok Kumar  wrote:

> thank you sir
>
> sql("select `_1` as firstcolumn from items")
>
> is there anyway one can keep the csv column names using databricks when
> mapping
>
> val r = df.filter(col("paid") > "").map(x =>
> (x.getString(0),x.getString(1).)
>
> can I call example  x.getString(0).as.(firstcolumn) in above when mapping
> if possible so columns will have labels
>
>
>
>
>
> On Thursday, 24 March 2016, 0:18, Michael Armbrust 
> wrote:
>
>
> You probably need to use `backticks` to escape `_1` since I don't think
> that its a valid SQL identifier.
>
> On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar  > wrote:
>
> Gurus,
>
> If I register a temporary table as below
>
>  r.toDF
> res58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3:
> double, _4: double, _5: double]
>
> r.toDF.registerTempTable("items")
>
> sql("select * from items")
> res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3:
> double, _4: double, _5: double]
>
> Is there anyway I can do a select on the first column only
>
> sql("select _1 from items" throws error
>
> Thanking you
>
>
>
>
>

Re: calling individual columns from spark temporary table

2016-03-23 Thread Michael Armbrust

You probably need to use `backticks` to escape `_1` since I don't think
that its a valid SQL identifier.

On Wed, Mar 23, 2016 at 5:10 PM, Ashok Kumar 
wrote:

> Gurus,
>
> If I register a temporary table as below
>
>  r.toDF
> res58: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3:
> double, _4: double, _5: double]
>
> r.toDF.registerTempTable("items")
>
> sql("select * from items")
> res60: org.apache.spark.sql.DataFrame = [_1: string, _2: string, _3:
> double, _4: double, _5: double]
>
> Is there anyway I can do a select on the first column only
>
> sql("select _1 from items" throws error
>
> Thanking you
>

[jira] [Resolved] (SPARK-14078) Simple FileSink for Parquet

2016-03-23 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14078.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11897
[https://github.com/apache/spark/pull/11897]

> Simple FileSink for Parquet
> ---
>
> Key: SPARK-14078
> URL: https://issues.apache.org/jira/browse/SPARK-14078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark schema evolution

2016-03-22 Thread Michael Armbrust

Which version of Spark?  This sounds like a bug (that might be fixed).

On Tue, Mar 22, 2016 at 6:34 AM, gtinside  wrote:

> Hi ,
>
> I have a table sourced from* 2 parquet files* with few extra columns in one
> of the parquet file. Simple * queries works fine but queries with predicate
> on extra column doesn't work and I get column not found
>
> *Column resp_party_type exist in just one parquet file*
>
> a) Query that work :
> select resp_party_type  from operational_analytics
>
> b) Query that doesn't work : (complains about missing column
> *resp_party_type *)
> select category as Events, resp_party as Team, count(*) as Total from
> operational_analytics where application = 'PeopleMover' and resp_party_type
> = 'Team' group by category, resp_party
>
> *Query Plan for (b)*
> == Physical Plan ==
> TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Final,isDistinct=false)],
> output=[Events#36266,Team#36267,Total#36268L])
>  TungstenExchange hashpartitioning(category#30986,resp_party#31006)
>   TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Partial,isDistinct=false)],
> output=[category#30986,resp_party#31006,currentCount#36272L])
>Project [category#30986,resp_party#31006]
> Filter ((application#30983 = PeopleMover) && (resp_party_type#31007 =
> Team))
>  Scan
>
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_peoplemover.parquet,snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_mis.parquet][category#30986,resp_party#31006,application#30983,resp_party_type#31007]
>
>
> I have set spark.sql.parquet.mergeSchema = true and
> spark.sql.parquet.filterPushdown = true. When I set
> spark.sql.parquet.filterPushdown = false Query (b) starts working,
> execution
> plan after setting the filterPushdown = false for Query(b)
>
> == Physical Plan ==
> TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Final,isDistinct=false)],
> output=[Events#36313,Team#36314,Total#36315L])
>  TungstenExchange hashpartitioning(category#30986,resp_party#31006)
>   TungstenAggregate(key=[category#30986,resp_party#31006],
> functions=[(count(1),mode=Partial,isDistinct=false)],
> output=[category#30986,resp_party#31006,currentCount#36319L])
>Project [category#30986,resp_party#31006]
> Filter ((application#30983 = PeopleMover) && (resp_party_type#31007 =
> Team))
>  Scan
>
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_peoplemover.parquet,snackfs://tst:9042/aladdin_data_beta/operational_analytics/operational_analytics_mis.parquet][category#30986,resp_party#31006,application#30983,resp_party_type#31007]
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-schema-evolution-tp26563.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

[jira] [Updated] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-03-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14070:
-
Target Version/s: 2.0.0

> Use ORC data source for SQL queries on ORC tables
> -
>
> Key: SPARK-14070
> URL: https://issues.apache.org/jira/browse/SPARK-14070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> Currently if one is trying to query ORC tables in Hive, the plan generated by 
> Spark hows that its using the `HiveTableScan` operator which is generic to 
> all file formats. We could instead use the ORC data source for this so that 
> we can get ORC specific optimizations like predicate pushdown.
> Current behaviour:
> ```
> scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*, None)]
> +- 'UnresolvedRelation `orc_table`, None
> == Analyzed Logical Plan ==
> key: string, value: string
> Project [key#171,value#172]
> +- MetastoreRelation default, orc_table, None
> == Optimized Logical Plan ==
> MetastoreRelation default, orc_table, None
> == Physical Plan ==
> HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14078) Simple FileSink for Parquet

2016-03-22 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-14078:


 Summary: Simple FileSink for Parquet
 Key: SPARK-14078
 URL: https://issues.apache.org/jira/browse/SPARK-14078
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-03-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14070:
-
Assignee: Tejas Patil

> Use ORC data source for SQL queries on ORC tables
> -
>
> Key: SPARK-14070
> URL: https://issues.apache.org/jira/browse/SPARK-14070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>
> Currently if one is trying to query ORC tables in Hive, the plan generated by 
> Spark hows that its using the `HiveTableScan` operator which is generic to 
> all file formats. We could instead use the ORC data source for this so that 
> we can get ORC specific optimizations like predicate pushdown.
> Current behaviour:
> ```
> scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*, None)]
> +- 'UnresolvedRelation `orc_table`, None
> == Analyzed Logical Plan ==
> key: string, value: string
> Project [key#171,value#172]
> +- MetastoreRelation default, orc_table, None
> == Optimized Logical Plan ==
> MetastoreRelation default, orc_table, None
> == Physical Plan ==
> HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14070) Use ORC data source for SQL queries on ORC tables

2016-03-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14070:
-
Shepherd: Michael Armbrust

> Use ORC data source for SQL queries on ORC tables
> -
>
> Key: SPARK-14070
> URL: https://issues.apache.org/jira/browse/SPARK-14070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> Currently if one is trying to query ORC tables in Hive, the plan generated by 
> Spark hows that its using the `HiveTableScan` operator which is generic to 
> all file formats. We could instead use the ORC data source for this so that 
> we can get ORC specific optimizations like predicate pushdown.
> Current behaviour:
> ```
> scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
> == Parsed Logical Plan ==
> 'Project [unresolvedalias(*, None)]
> +- 'UnresolvedRelation `orc_table`, None
> == Analyzed Logical Plan ==
> key: string, value: string
> Project [key#171,value#172]
> +- MetastoreRelation default, orc_table, None
> == Optimized Logical Plan ==
> MetastoreRelation default, orc_table, None
> == Physical Plan ==
> HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, None
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13985) WAL for determistic batches with IDs

2016-03-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13985.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11804
[https://github.com/apache/spark/pull/11804]

> WAL for determistic batches with IDs
> 
>
> Key: SPARK-13985
> URL: https://issues.apache.org/jira/browse/SPARK-13985
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> We can simplify the sink such that it only needs to provide idempotence 
> instead of atomicity by ensuring batches are deterministically computed.  
> Lets do this using a WAL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14029) Improve BooleanSimplification optimization by implementing `Not` canonicalization

2016-03-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14029:
-
Assignee: Dongjoon Hyun

> Improve BooleanSimplification optimization by implementing `Not` 
> canonicalization
> -
>
> Key: SPARK-14029
> URL: https://issues.apache.org/jira/browse/SPARK-14029
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, *BooleanSimplification* optimization can handle the following 
> cases.
> * a && (!'a || 'b ) ==> 'a && 'b
> * a && ('b || !'a ) ==> 'a && 'b
> However, it can not handle the followings cases since those equations fails 
> at the comparisons between their canonicalized forms.
> * a < 1 && (!('a < 1) || 'b) ==> ('a < 1) && 'b
> * a <= 1 && (!('a <= 1) || 'b) ==> ('a <= 1) && 'b
> * a > 1 && (!('a > 1) || 'b) ==> ('a > 1) && 'b
> * a >= 1 && (!('a >= 1) || 'b) ==> ('a >= 1) && 'b
> This issue aims to handle the above cases and extends toward the followings 
> too.
> * a < 1 && ('a >= 1 || 'b )   ==> ('a < 1) && 'b
> * a <= 1 && ('a > 1 || 'b )   ==> ('a <= 1) && 'b
> * a > 1 && (('a <= 1) || 'b)  ==> ('a > 1) && 'b
> * a >= 1 && (('a < 1) || 'b)  ==> ('a >= 1) && 'b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14029) Improve BooleanSimplification optimization by implementing `Not` canonicalization

2016-03-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-14029.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11851
[https://github.com/apache/spark/pull/11851]

> Improve BooleanSimplification optimization by implementing `Not` 
> canonicalization
> -
>
> Key: SPARK-14029
> URL: https://issues.apache.org/jira/browse/SPARK-14029
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, *BooleanSimplification* optimization can handle the following 
> cases.
> * a && (!'a || 'b ) ==> 'a && 'b
> * a && ('b || !'a ) ==> 'a && 'b
> However, it can not handle the followings cases since those equations fails 
> at the comparisons between their canonicalized forms.
> * a < 1 && (!('a < 1) || 'b) ==> ('a < 1) && 'b
> * a <= 1 && (!('a <= 1) || 'b) ==> ('a <= 1) && 'b
> * a > 1 && (!('a > 1) || 'b) ==> ('a > 1) && 'b
> * a >= 1 && (!('a >= 1) || 'b) ==> ('a >= 1) && 'b
> This issue aims to handle the above cases and extends toward the followings 
> too.
> * a < 1 && ('a >= 1 || 'b )   ==> ('a < 1) && 'b
> * a <= 1 && ('a > 1 || 'b )   ==> ('a <= 1) && 'b
> * a > 1 && (('a <= 1) || 'b)  ==> ('a > 1) && 'b
> * a >= 1 && (('a < 1) || 'b)  ==> ('a >= 1) && 'b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13883) buildReader implementation for parquet

2016-03-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13883.
--
Resolution: Fixed

Issue resolved by pull request 11709
[https://github.com/apache/spark/pull/11709]

> buildReader implementation for parquet
> --
>
> Key: SPARK-13883
> URL: https://issues.apache.org/jira/browse/SPARK-13883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> Port parquet to the new strategy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Michael Armbrust

>
> But when tired using Spark streamng I could not find a way to store the
> data with the avro schema information. The closest that I got was to create
> a Dataframe using the json RDDs and store them as parquet. Here the parquet
> files had a spark specific schema in their footer.
>

Does this cause a problem?  This is just extra information that we use to
store metadata that parquet doesn't directly support, but I would still
expect other systems to be able to read it.

Re: Spark SQL Optimization

2016-03-21 Thread Michael Armbrust

It's helpful if you can include the output of EXPLAIN EXTENDED or
df.explain(true) whenever asking about query performance.

On Mon, Mar 21, 2016 at 6:27 AM, gtinside  wrote:

> Hi ,
>
> I am trying to execute a simple query with join on 3 tables. When I look at
> the execution plan , it varies with position of table in the "from" clause.
> Execution plan looks more optimized when the position of table with
> predicates is specified before any other table.
>
>
> Original query :
>
> select distinct pge.portfolio_code
> from table1 pge join table2 p
> on p.perm_group = pge.anc_port_group
> join table3 uge
> on p.user_group=uge.anc_user_group
> where uge.user_name = 'user' and p.perm_type = 'TEST'
>
> Optimized query (table with predicates is moved ahead):
>
> select distinct pge.portfolio_code
> from table1 uge, table2 p, table3 pge
> where uge.user_name = 'user' and p.perm_type = 'TEST'
> and p.perm_group = pge.anc_port_group
> and p.user_group=uge.anc_user_group
>
>
> Execution plan is more optimized for the optimized query and hence the
> query
> executes faster. All the tables are being sourced from parquet files
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Subquery performance

2016-03-20 Thread Michael Armbrust

If you encode the data in something like parquet we usually have more
information and will try to broadcast.

On Thu, Mar 17, 2016 at 7:27 PM, Younes Naguib <
younes.nag...@tritondigital.com> wrote:

> Anyways to cache the subquery or force a broadcast join without persisting
> it?
>
>
>
> y
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* March-17-16 8:59 PM
> *To:* Younes Naguib
> *Cc:* user@spark.apache.org
> *Subject:* Re: Subquery performance
>
>
>
> Try running EXPLAIN on both version of the query.
>
>
>
> Likely when you cache the subquery we know that its going to be small so
> use a broadcast join instead of a shuffling the data.
>
>
>
> On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib <
> younes.nag...@tritondigital.com> wrote:
>
> Hi all,
>
>
>
> I’m running a query that looks like the following:
>
> Select col1, count(1)
>
> From (Select col2, count(1) from tab2 group by col2)
>
> Inner join tab1 on (col1=col2)
>
> Group by col1
>
>
>
> This creates a very large shuffle, 10 times the data size, as if the
> subquery was executed for each row.
>
> Anything can be done to tune to help tune this?
>
> When the subquery in persisted, it runs much faster, and the shuffle is 50
> times smaller!
>
>
>
> *Thanks,*
>
> *Younes*
>
>
>

Re: Subquery performance

2016-03-19 Thread Michael Armbrust

Try running EXPLAIN on both version of the query.

Likely when you cache the subquery we know that its going to be small so
use a broadcast join instead of a shuffling the data.

On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib <
younes.nag...@tritondigital.com> wrote:

> Hi all,
>
>
>
> I’m running a query that looks like the following:
>
> Select col1, count(1)
>
> From (Select col2, count(1) from tab2 group by col2)
>
> Inner join tab1 on (col1=col2)
>
> Group by col1
>
>
>
> This creates a very large shuffle, 10 times the data size, as if the
> subquery was executed for each row.
>
> Anything can be done to tune to help tune this?
>
> When the subquery in persisted, it runs much faster, and the shuffle is 50
> times smaller!
>
>
>
> *Thanks,*
>
> *Younes*
>

[jira] [Created] (SPARK-13985) WAL for determistic batches with IDs

2016-03-19 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-13985:


 Summary: WAL for determistic batches with IDs
 Key: SPARK-13985
 URL: https://issues.apache.org/jira/browse/SPARK-13985
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


We can simplify the sink such that it only needs to provide idempotence instead 
of atomicity by ensuring batches are deterministically computed.  Lets do this 
using a WAL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13427) Support USING clause in JOIN

2016-03-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13427.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11297
[https://github.com/apache/spark/pull/11297]

> Support USING clause in JOIN
> 
>
> Key: SPARK-13427
> URL: https://issues.apache.org/jira/browse/SPARK-13427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dilip Biswal
> Fix For: 2.0.0
>
>
> Support queries that JOIN tables with USING clause.
> SELECT * from table1 JOIN table2 USING 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13945) Enable native view flag by default

2016-03-18 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13945:
-
Target Version/s: 2.0.0

> Enable native view flag by default
> --
>
> Key: SPARK-13945
> URL: https://issues.apache.org/jira/browse/SPARK-13945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> We need to change the default value of {{spark.sql.nativeView}} to true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Michael Armbrust

Patrick reuploaded the artifacts, so it should be fixed now.
On Mar 16, 2016 5:48 PM, "Nicholas Chammas" 
wrote:

> Looks like the other packages may also be corrupt. I’m getting the same
> error for the Spark 1.6.1 / Hadoop 2.4 package.
>
>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>
> Nick
> 
>
> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu  wrote:
>
>> On Linux, I got:
>>
>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>
>> gzip: stdin: unexpected end of file
>> tar: Unexpected EOF in archive
>> tar: Unexpected EOF in archive
>> tar: Error is not recoverable: exiting now
>>
>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> Does anyone else have trouble unzipping this? How did this happen?
>>>
>>> What I get is:
>>>
>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>
>>> Seems like a strange type of problem to come across.
>>>
>>> Nick
>>> 
>>>
>>
>>

[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM

2016-03-15 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195967#comment-15195967
 ] 

Michael Armbrust commented on SPARK-12546:
--

There is no partitioning in that example so it is not the same problem.  Also, 
stacktrace is in zeplin code, not spark.

> Writing to partitioned parquet table can fail with OOM
> --
>
> Key: SPARK-12546
> URL: https://issues.apache.org/jira/browse/SPARK-12546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Nong Li
>    Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.1, 2.0.0
>
>
> It is possible to have jobs fail with OOM when writing to a partitioned 
> parquet table. While this was probably always possible, it is more likely in 
> 1.6 due to the memory manager changes. The unified memory manager enables 
> Spark to use more of the process memory (in particular, for execution) which 
> gets us in this state more often. This issue can happen for libraries that 
> consume a lot of memory, such as parquet. Prior to 1.6, these libraries would 
> more likely use memory that spark was not using (i.e. from the storage pool). 
> In 1.6, this storage memory can now be used for execution.
> There are a couple of configs that can help with this issue.
>   - parquet.memory.pool.ratio: This is a parquet config on how much of the 
> heap the parquet writers should use. This default to .95. Consider a much 
> lower value (e.g. 0.1)
>   - spark.memory.faction: This is a spark config to control how much of the 
> memory should be allocated to spark. Consider setting this to 0.6.
> This should cause jobs to potentially spill more but require less memory. 
> More aggressive tuning will control this trade off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13876) Strategy for planning scans of files

2016-03-15 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13876.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11646


> Strategy for planning scans of files
> 
>
> Key: SPARK-13876
> URL: https://issues.apache.org/jira/browse/SPARK-13876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 2.0.0
>
>
> We should specialize the logic for planning scans over sets of files.  
> Requirements:
>  - remove the need to have RDD, broadcastedHadoopConf and other distributed 
> concerns in the public API of org.apache.spark.sql.sources.FileFormat
>  - Partition column appending should be delegated to the format to avoid an 
> extra copy / devectorization when appending partition columns
>  - Should minimize the amount of data that is shipped to each executor (i.e. 
> it does not send the whole list of files to every worker in the form of a 
> hadoop conf)
>  - should natively support bucketing files into partitions, and thus does not 
> require coalescing / creating a UnionRDD with the correct partitioning.
>  - Small files should be automatically coalesced into fewer tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13664) Simplify and Speedup HadoopFSRelation

2016-03-15 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-13664:
--

> Simplify and Speedup HadoopFSRelation
> -
>
> Key: SPARK-13664
> URL: https://issues.apache.org/jira/browse/SPARK-13664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 2.0.0
>
>
> A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, 
> however there are currently several complexity and performance problems with 
> this code path:
>  - The class mixes the concerns of file management, schema reconciliation, 
> scan building, bucketing, partitioning, and writing data.
>  - For very large tables, we are broadcasting the entire list of files to 
> every executor. [SPARK-11441]
>  - For partitioned tables, we always do an extra projection.  This results 
> not only in a copy, but undoes much of the performance gains that we are 
> going to get from vectorized reads.
> This is an umbrella ticket to track a set of improvements to this codepath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: question about catalyst and TreeNode

2016-03-15 Thread Michael Armbrust

Trees are immutable, and TreeNode takes care of copying unchanged parts of
the tree when you are doing transformations.  As a result, even if you do
construct a DAG with the Dataset API, the first transformation will turn it
back into a tree.

The only exception to this rule is when we share the results of plans after
an Exchange operator.  This is the last step before execution and sometimes
turns the query into a DAG to avoid redundant computation.

On Tue, Mar 15, 2016 at 9:01 AM, Koert Kuipers  wrote:

> i am trying to understand some parts of the catalyst optimizer. but i
> struggle with one bigger picture issue:
>
> LogicalPlan extends TreeNode, which makes sense since the optimizations
> rely on tree transformations like transformUp and transformDown.
>
> but how can a LogicalPlan be a tree? isnt it really a DAG? if it is
> possible to create diamond-like operator dependencies, then assumptions
> made in tree transformations could be wrong? for example pushing a limit
> operator down into a child sounds safe, but if that same child is also used
> by another operator (so it has another parent, no longer a tree) then its
> not safe at all.
>
> what am i missing here?
> thanks! koert
>

[jira] [Created] (SPARK-13883) buildReader implementation for parquet

2016-03-14 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-13883:


 Summary: buildReader implementation for parquet
 Key: SPARK-13883
 URL: https://issues.apache.org/jira/browse/SPARK-13883
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


Port parquet to the new strategy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13791) Add MetadataLog and HDFSMetadataLog

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13791.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11625
[https://github.com/apache/spark/pull/11625]

> Add MetadataLog and HDFSMetadataLog
> ---
>
> Key: SPARK-13791
> URL: https://issues.apache.org/jira/browse/SPARK-13791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> - Add a MetadataLog interface for  metadata reliably storage.
> - Add HDFSMetadataLog as a MetadataLog implementation based on HDFS. 
> - Update FileStreamSource to use HDFSMetadataLog instead of managing metadata 
> by itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10380) Confusing examples in pyspark SQL docs

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10380:
-
Assignee: Reynold Xin

> Confusing examples in pyspark SQL docs
> --
>
> Key: SPARK-10380
> URL: https://issues.apache.org/jira/browse/SPARK-10380
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>    Reporter: Michael Armbrust
>Assignee: Reynold Xin
>Priority: Minor
>  Labels: docs, starter
> Fix For: 2.0.0
>
>
> There’s an error in the astype() documentation, as it uses cast instead of 
> astype. It should probably include a mention that astype is an alias for cast 
> (and vice versa in the cast documentation): 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.astype
>  
> The same error occurs with drop_duplicates and dropDuplicates: 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop_duplicates
>  
> The issue here is we are copying the code.  According to [~davies] the 
> easiest way is to copy the method and just add new docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10380) Confusing examples in pyspark SQL docs

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10380.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11698
[https://github.com/apache/spark/pull/11698]

> Confusing examples in pyspark SQL docs
> --
>
> Key: SPARK-10380
> URL: https://issues.apache.org/jira/browse/SPARK-10380
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>    Reporter: Michael Armbrust
>Priority: Minor
>  Labels: docs, starter
> Fix For: 2.0.0
>
>
> There’s an error in the astype() documentation, as it uses cast instead of 
> astype. It should probably include a mention that astype is an alias for cast 
> (and vice versa in the cast documentation): 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.astype
>  
> The same error occurs with drop_duplicates and dropDuplicates: 
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop_duplicates
>  
> The issue here is we are copying the code.  According to [~davies] the 
> easiest way is to copy the method and just add new docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13664) Simplify and Speedup HadoopFSRelation

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13664.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11646
[https://github.com/apache/spark/pull/11646]

> Simplify and Speedup HadoopFSRelation
> -
>
> Key: SPARK-13664
> URL: https://issues.apache.org/jira/browse/SPARK-13664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 2.0.0
>
>
> A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, 
> however there are currently several complexity and performance problems with 
> this code path:
>  - The class mixes the concerns of file management, schema reconciliation, 
> scan building, bucketing, partitioning, and writing data.
>  - For very large tables, we are broadcasting the entire list of files to 
> every executor. [SPARK-11441]
>  - For partitioned tables, we always do an extra projection.  This results 
> not only in a copy, but undoes much of the performance gains that we are 
> going to get from vectorized reads.
> This is an umbrella ticket to track a set of improvements to this codepath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13118) Support for classes defined in package objects

2016-03-14 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194436#comment-15194436
 ] 

Michael Armbrust commented on SPARK-13118:
--

Its likely that we have fixed this with other refactorings.  If you add that 
regression test I think we can close this.

> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13531:
-
Target Version/s: 2.0.0

> Some DataFrame joins stopped working with UnsupportedOperationException: No 
> size estimation available for objects
> -
>
> Key: SPARK-13531
> URL: https://issues.apache.org/jira/browse/SPARK-13531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> this is using spark 2.0.0-SNAPSHOT
> dataframe df1:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , obj#135: object, [if (input[0, object].isNullAt) 
> null else input[0, object].get AS x#128]
> +- MapPartitions , createexternalrow(if (isnull(x#9)) null else 
> x#9), [input[0, object] AS obj#135]
>+- WholeStageCodegen
>   :  +- Project [_1#8 AS x#9]
>   : +- Scan ExistingRDD[_1#8]{noformat}
> show:
> {noformat}+---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+{noformat}
> dataframe df2:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true), 
> StructField(y,StringType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else 
> y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get 
> AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> object].get, true) AS y#131]
> +- WholeStageCodegen
>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>: +- Scan ExistingRDD[_1#0,_2#1]{noformat}
> show:
> {noformat}+---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+{noformat}
> i run:
> df1.join(df2, Seq("x")).show
> i get:
> {noformat}java.lang.UnsupportedOperationException: No size estimation 
> available for objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat}
> now sure what changed, this ran about a week ago without issues (in our 
> internal unit tests). it is fully reproducible, however when i tried to 
> minimize the issue i could not reproduce it by just creating data frames in 
> the repl with the same contents, so it probably has something to do with way 
> these are created (from Row objects and StructTypes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13876) Strategy for planning scans of files

2016-03-14 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-13876:


 Summary: Strategy for planning scans of files
 Key: SPARK-13876
 URL: https://issues.apache.org/jira/browse/SPARK-13876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical


We should specialize the logic for planning scans over sets of files.  
Requirements:
 - remove the need to have RDD, broadcastedHadoopConf and other distributed 
concerns in the public API of org.apache.spark.sql.sources.FileFormat
 - Partition column appending should be delegated to the format to avoid an 
extra copy / devectorization when appending partition columns
 - Should minimize the amount of data that is shipped to each executor (i.e. it 
does not send the whole list of files to every worker in the form of a hadoop 
conf)
 - should natively support bucketing files into partitions, and thus does not 
require coalescing / creating a UnionRDD with the correct partitioning.
 - Small files should be automatically coalesced into fewer tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust

On Mon, Mar 14, 2016 at 1:30 PM, Prabhu Joseph 
wrote:
>
> Thanks for the recommendation. But can you share what are the
> improvements made above Spark-1.2.1 and how which specifically handle the
> issue that is observed here.
>

Memory used for query execution is now explicitly accounted for:

https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

[jira] [Updated] (SPARK-13658) BooleanSimplification rule is slow with large boolean expressions

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13658:
-
Assignee: Liang-Chi Hsieh

> BooleanSimplification rule is slow with large boolean expressions
> -
>
> Key: SPARK-13658
> URL: https://issues.apache.org/jira/browse/SPARK-13658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the 
> optimizer rule BooleanSimplification take about 2 seconds (it use lots of 
> sematicsEqual, which require copy the whole tree).
> It will great if we could speedup it.
> [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13658) BooleanSimplification rule is slow with large boolean expressions

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13658.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11647
[https://github.com/apache/spark/pull/11647]

> BooleanSimplification rule is slow with large boolean expressions
> -
>
> Key: SPARK-13658
> URL: https://issues.apache.org/jira/browse/SPARK-13658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the 
> optimizer rule BooleanSimplification take about 2 seconds (it use lots of 
> sematicsEqual, which require copy the whole tree).
> It will great if we could speedup it.
> [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Michael Armbrust

+1 to upgrading Spark.  1.2.1 has non of the memory management improvements
that were added in 1.4-1.6.

On Mon, Mar 14, 2016 at 2:03 AM, Prabhu Joseph 
wrote:

> The issue is the query hits OOM on a Stage when reading Shuffle Output
> from previous stage.How come increasing shuffle memory helps to avoid OOM.
>
> On Mon, Mar 14, 2016 at 2:28 PM, Sabarish Sasidharan <
> sabarish@gmail.com> wrote:
>
>> Thats a pretty old version of Spark SQL. It is devoid of all the
>> improvements introduced in the last few releases.
>>
>> You should try bumping your spark.sql.shuffle.partitions to a value
>> higher than default (5x or 10x). Also increase your shuffle memory fraction
>> as you really are not explicitly caching anything. You could simply swap
>> the fractions in your case.
>>
>> Regards
>> Sab
>>
>> On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> It is a Spark-SQL and the version used is Spark-1.2.1.
>>>
>>> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
 I believe the OP is using Spark SQL and not Hive on Spark.

 Regards
 Sab

 On Mon, Mar 14, 2016 at 1:55 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> I think the only version of Spark that works OK with Hive (Hive on
> Spark engine) is version 1.3.1. I also get OOM from time to time and have
> to revert using MR
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 March 2016 at 08:06, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Which version of Spark are you using? The configuration varies by
>> version.
>>
>> Regards
>> Sab
>>
>> On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> A Hive Join query which runs fine and faster in MapReduce takes lot
>>> of time with Spark and finally fails with OOM.
>>>
>>> *Query:  hivejoin.py*
>>>
>>> from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import HiveContext
>>> conf = SparkConf().setAppName("Hive_Join")
>>> sc = SparkContext(conf=conf)
>>> hiveCtx = HiveContext(sc)
>>> hiveCtx.hql("INSERT OVERWRITE TABLE D select <80 columns> from A a
>>> INNER JOIN B b ON a.item_id = b.item_id LEFT JOIN C c ON c.instance_id =
>>> a.instance_id");
>>> results = hiveCtx.hql("SELECT COUNT(1) FROM D").collect()
>>> print results
>>>
>>>
>>> *Data Study:*
>>>
>>> Number of Rows:
>>>
>>> A table has 1002093508
>>> B table has5371668
>>> C table has  1000
>>>
>>> No Data Skewness:
>>>
>>> item_id in B is unique and A has multiple rows with same item_id, so
>>> after first INNER_JOIN the result set is same 1002093508 rows
>>>
>>> instance_id in C is unique and A has multiple rows with same
>>> instance_id (maximum count of number of rows with same instance_id is 
>>> 250)
>>>
>>> Spark Job runs with 90 Executors each with 2cores and 6GB memory.
>>> YARN has allotted all the requested resource immediately and no other 
>>> job
>>> is running on the
>>> cluster.
>>>
>>> spark.storage.memoryFraction 0.6
>>> spark.shuffle.memoryFraction 0.2
>>>
>>> Stage 2 - reads data from Hadoop, Tasks has NODE_LOCAL and shuffle
>>> write 500GB of intermediate data
>>>
>>> Stage 3 - does shuffle read of 500GB data, tasks has PROCESS_LOCAL
>>> and output of 400GB is shuffled
>>>
>>> Stage 4 - tasks fails with OOM on reading the shuffled output data
>>> when it reached 40GB data itself
>>>
>>> First of all, what kind of Hive queries when run on Spark gets a
>>> better performance than Mapreduce. And what are the hive queries that 
>>> won't
>>> perform
>>> well in Spark.
>>>
>>> How to calculate the optimal Heap for Executor Memory and the number
>>> of executors for given input data size. We don't specify Spark 
>>> Executors to
>>> cache any data. But how come Stage 3 tasks says PROCESS_LOCAL. Why 
>>> Stage 4
>>> is failing immediately
>>> when it has just read 40GB data, is it caching data in Memory.
>>>
>>> And in a Spark job, some stage will need lot of memory for shuffle
>>> and some need lot of memory for cache. So, when a Spark Executor has 
>>> lot of
>>> memory available
>>> for cache and does not use the cache but when there is a need to do
>>> lot of shuffle, will executors only use the shuffle fraction which is 
>>> set
>>> for doing shuffle or w

[jira] [Assigned] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-13855:


Assignee: Michael Armbrust

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>    Assignee: Michael Armbrust
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Michael Armbrust

>
> Each json file is of a single object and has the potential to have
> variance in the schema.
>
How much variance are we talking?  JSON->Parquet is going to do well with
100s of different columns, but at 10,000s many things will probably start
breaking.

Re: Can someone fix this download URL?

2016-03-14 Thread Michael Armbrust

Yeah, sorry.  I'll make sure this gets fixed.

On Mon, Mar 14, 2016 at 12:48 AM, Sean Owen  wrote:

> Yeah I can't seem to download any of the artifacts via the direct download
> / cloudfront URL. The Apache mirrors are fine, so use those for the moment.
> @marmbrus were you maybe the last to deal with these artifacts during the
> release? I'm not sure where they are or how they get uploaded or I'd look
> deeper.
>
> On Mon, Mar 14, 2016 at 4:22 AM, Akhil Das 
> wrote:
>
>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>
>> [image: Inline image 1]
>>
>> There's a broken link for Spark 1.6.1 prebuilt hadoop 2.6 direct download.
>>
>>
>> Thanks
>> Best Regards
>>
>
>

Re: adding rows to a DataFrame

2016-03-11 Thread Michael Armbrust

Or look at explode on DataFrame

On Fri, Mar 11, 2016 at 10:45 AM, Stefan Panayotov 
wrote:

> Hi,
>
> I have a problem that requires me to go through the rows in a DataFrame
> (or possibly through rows in a JSON file) and conditionally add rows
> depending on a value in one of the columns in each existing row. So, for
> example if I have:
>
>
> +---+---+---+
> | _1| _2| _3|
> +---+---+---+
> |ID1|100|1.1|
> |ID2|200|2.2|
> |ID3|300|3.3|
> |ID4|400|4.4|
> +---+---+---+
>
> I need to be able to get:
>
>
> +---+---+---++---+
> | _1| _2| _3|  _4| _5|
> +---+---+---++---+
> |ID1|100|1.1|ID1 add text or d...| 25|
> |id11 ..|21 |
> |id12 ..|22 |
> |ID2|200|2.2|ID2 add text or d...| 50|
> |id21 ..|33 |
> |id22 ..|34 |
> |id23 ..|35 |
> |ID3|300|3.3|ID3 add text or d...| 75|
> |id31 ..|11 |
> |ID4|400|4.4|ID4 add text or d...|100|
> |id41 ..|51 |
> |id42 ..|52 |
> |id43 ..|53 |
> |id44 ..|54 |
> +---+---+---++---+
>
> How can I achieve this in Spark without doing DF.collect(), which will get
> everything to the driver and for a big data set I'll get OOM?
> BTW, I know how to use withColumn() to add new columns to the DataFrame. I
> need to also add new rows.
> Any help will be appreciated.
>
> Thanks,
>
>
> *Stefan Panayotov, PhD **Home*: 610-355-0919
> *Cell*: 610-517-5586
> *email*: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>
>

Re: udf StructField to JSON String

2016-03-11 Thread Michael Armbrust

df.select("event").toJSON

On Fri, Mar 11, 2016 at 9:53 AM, Caires Vinicius  wrote:

> Hmm. I think my problem is a little more complex. I'm using
> https://github.com/databricks/spark-redshift and when I read from JSON
> file I got this schema.
>
> root
>
> |-- app: string (nullable = true)
>
>  |-- ct: long (nullable = true)
>
>  |-- event: struct (nullable = true)
>
> ||-- attributes: struct (nullable = true)
>
>  |||-- account: string (nullable = true)
>
>  |||-- accountEmail: string (nullable = true)
>
>  |||-- accountId: string (nullable = true)
>
>
> I want to transform the Column *event* into String (formatted as JSON).
>
> I was trying to use udf but without success.
>
> On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon 
> wrote:
>
>> Have you looked at DataFrame.write.json( path )?
>>
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>
>> > On Mar 11, 2016, at 7:15 AM, Caires Vinicius 
>> wrote:
>> >
>> > I have one DataFrame with nested StructField and I want to convert to
>> JSON String. There is anyway to accomplish this?
>>
>>

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust

Spark 1.6.1 is a maintenance release containing stability fixes. This
release is based on the branch-1.6 maintenance branch of Spark. We
*strongly recommend* all 1.6.0 users to upgrade to this release.

Notable fixes include:
 - Workaround for OOM when writing large partitioned tables SPARK-12546

 - Several fixes to the experimental Dataset API - SPARK-12478
, SPARK-12696
, SPARK-13101
, SPARK-12932


The full list of bug fixes is here: http://s.apache.org/spark-1.6.1
http://spark.apache.org/releases/spark-release-1-6-1.html

(note: it can take a few hours for everything to be propagated, so you
might get 404 on some download links, but everything should be in maven
central already.  If you see any issues with the release notes or webpage
*please contact me directly, off-list*)

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust

Spark 1.6.1 is a maintenance release containing stability fixes. This
release is based on the branch-1.6 maintenance branch of Spark. We
*strongly recommend* all 1.6.0 users to upgrade to this release.

Notable fixes include:
 - Workaround for OOM when writing large partitioned tables SPARK-12546

 - Several fixes to the experimental Dataset API - SPARK-12478
, SPARK-12696
, SPARK-13101
, SPARK-12932


The full list of bug fixes is here: http://s.apache.org/spark-1.6.1
http://spark.apache.org/releases/spark-release-1-6-1.html

(note: it can take a few hours for everything to be propagated, so you
might get 404 on some download links, but everything should be in maven
central already.  If you see any issues with the release notes or webpage
*please contact me directly, off-list*)

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust

A few clarifications:


> 1) High memory and cpu usage. This is because Parquet files can't be
> streamed into as records arrive. I have seen a lot of OOMs in reasonably
> sized MR/Spark containers that write out Parquet. When doing dynamic
> partitioning, where many writers are open at once, we’ve seen customers
> having trouble to make it work. This has made for some very confused ETL
> developers.
>

In Spark 1.6.1 we avoid having more than 2 files open per task, so this
should be less of a problem even for dynamic partitioning.


> 2) Parquet lags well behind Avro in schema evolution semantics. Can only
> add columns at the end? Deleting columns at the end is not recommended if
> you plan to add any columns in the future. Reordering is not supported in
> current release.
>

This may be true for Impala, but Spark SQL does schema merging by name so
you can add / reorder columns with the constraint that you cannot reuse a
name with an incompatible type.

[RESULT] [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust

This vote passes with nine +1s (five binding) and one binding +0!  Thanks
to everyone who tested/voted.  I'll start work on publishing the release
today.

+1:
Mark Hamstra*
Moshe Eshel
Egor Pahomov
Reynold Xin*
Yin Huai*
Andrew Or*
Burak Yavuz
Kousuke Saruta
Michael Armbrust*

0:
Sean Owen*


-1: (none)

*Binding

On Wed, Mar 9, 2016 at 3:29 PM, Michael Armbrust 
wrote:

> +1 - Ported all our internal jobs to run on 1.6.1 with no regressions.
>
> On Wed, Mar 9, 2016 at 7:04 AM, Kousuke Saruta 
> wrote:
>
>> +1 (non-binding)
>>
>>
>> On 2016/03/09 4:28, Burak Yavuz wrote:
>>
>> +1
>>
>> On Tue, Mar 8, 2016 at 10:59 AM, Andrew Or  wrote:
>>
>>> +1
>>>
>>> 2016-03-08 10:59 GMT-08:00 Yin Huai < 
>>> yh...@databricks.com>:
>>>
>>>> +1
>>>>
>>>> On Mon, Mar 7, 2016 at 12:39 PM, Reynold Xin < 
>>>> r...@databricks.com> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>>
>>>>> On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov <
>>>>> pahomov.e...@gmail.com> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Spark ODBC server is fine, SQL is fine.
>>>>>>
>>>>>> 2016-03-03 12:09 GMT-08:00 Yin Yang < 
>>>>>> yy201...@gmail.com>:
>>>>>>
>>>>>>> Skipping docker tests, the rest are green:
>>>>>>>
>>>>>>> [INFO] Spark Project External Kafka ... SUCCESS
>>>>>>> [01:28 min]
>>>>>>> [INFO] Spark Project Examples . SUCCESS
>>>>>>> [02:59 min]
>>>>>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS
>>>>>>> [ 11.680 s]
>>>>>>> [INFO]
>>>>>>> 
>>>>>>> [INFO] BUILD SUCCESS
>>>>>>> [INFO]
>>>>>>> 
>>>>>>> [INFO] Total time: 02:16 h
>>>>>>> [INFO] Finished at: 2016-03-03T11:17:07-08:00
>>>>>>> [INFO] Final Memory: 152M/4062M
>>>>>>>
>>>>>>> On Thu, Mar 3, 2016 at 8:55 AM, Yin Yang < 
>>>>>>> yy201...@gmail.com> wrote:
>>>>>>>
>>>>>>>> When I ran test suite using the following command:
>>>>>>>>
>>>>>>>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>>>>>>>> -Dhadoop.version=2.7.0 package
>>>>>>>>
>>>>>>>> I got failure in Spark Project Docker Integration Tests :
>>>>>>>>
>>>>>>>> 16/03/02 17:36:46 INFO RemoteActorRefProvider$RemotingTerminator:
>>>>>>>> Remote daemon shut down; proceeding with flushing remote transports.
>>>>>>>> ^[[31m*** RUN ABORTED ***^[[0m
>>>>>>>> ^[[31m  com.spotify.docker.client.DockerException:
>>>>>>>> java.util.concurrent.ExecutionException:
>>>>>>>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
>>>>>>>> java.io.IOException: No such file or directory^[[0m
>>>>>>>> ^[[31m  at
>>>>>>>> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)^[[0m
>>>>>>>> ^[[31m  at
>>>>>>>> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)^[[0m
>>>>>>>> ^[[31m  at
>>>>>>>> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
>>>>>>>> ^[[31m  at
>>>>>>>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
>>>>>>>> ^[[31m  at
>>>>>>>> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
>>>>>>>> ^[[31m  at
>>>>>>>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
>>>>>>>> ^[[31m  at
>>>>>>>> org.scalatest.BeforeA

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-09 Thread Michael Armbrust

ion: No such file or directory^[[0m
>>>>>>> ^[[31m  at
>>>>>>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)^[[0m
>>>>>>> ^[[31m  ...^[[0m
>>>>>>> ^[[31m  Cause:
>>>>>>> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException:
>>>>>>> java.io.IOException: No such file or directory^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:262)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:487)^[[0m
>>>>>>> ^[[31m  at
>>>>>>> org.glassfish.jersey.client.ClientRuntime$2.run(ClientRuntime.java:177)^[[0m
>>>>>>> ^[[31m  ...^[[0m
>>>>>>> ^[[31m  Cause: java.io.IOException: No such file or directory^[[0m
>>>>>>> ^[[31m  at
>>>>>>> jnr.unixsocket.UnixSocketChannel.doConnect(UnixSocketChannel.java:94)^[[0m
>>>>>>>
>>>>>>> Has anyone seen the above ?
>>>>>>>
>>>>>>> On Wed, Mar 2, 2016 at 2:45 PM, Michael Armbrust <
>>>>>>> mich...@databricks.com> wrote:
>>>>>>>
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 1.6.1!
>>>>>>>>
>>>>>>>> The vote is open until Saturday, March 5, 2016 at 20:00 UTC and
>>>>>>>> passes if a majority of at least 3+1 PMC votes are cast.
>>>>>>>>
>>>>>>>> [ ] +1 Release this package as Apache Spark 1.6.1
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>
>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>> <http://spark.apache.org/>http://spark.apache.org/
>>>>>>>>
>>>>>>>> The tag to be voted on is *v1.6.1-rc1
>>>>>>>> (15de51c238a7340fa81cb0b80d029a05d97bfc5c)
>>>>>>>> <https://github.com/apache/spark/tree/v1.6.1-rc1>*
>>>>>>>>
>>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>>> at:
>>>>>>>>
>>>>>>>> <https://home.apache.org/%7Epwendell/spark-releases/spark-1.6.1-rc1-bin/>
>>>>>>>> https://home.apache.org/~pwendell/spark-releases/spark-1.6.1-rc1-bin/
>>>>>>>>
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> <https://people.apache.org/keys/committer/pwendell.asc>
>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>
>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>
>>>>>>>> <https://repository.apache.org/content/repositories/orgapachespark-1180/>
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1180/
>>>>>>>>
>>>>>>>> The test repository (versioned as v1.6.1-rc1) for this release can
>>>>>>>> be found at:
>>>>>>>>
>>>>>>>> <https://repository.apache.org/content/repositories/orgapachespark-1179/>
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1179/
>>>>>>>>
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>
>>>>>>>> <https://home.apache.org/%7Epwendell/spark-releases/spark-1.6.1-rc1-docs/>
>>>>>>>> https://home.apache.org/~pwendell/spark-releases/spark-1.6.1-rc1-docs/
>>>>>>>>
>>>>>>>>
>>>>>>>> ===
>>>>>>>> == How can I help test this release? ==
>>>>>>>> ===
>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>> taking an existing Spark workload and running on this release 
>>>>>>>> candidate,
>>>>>>>> then reporting any regressions from 1.6.0.
>>>>>>>>
>>>>>>>> 
>>>>>>>> == What justifies a -1 vote for this release? ==
>>>>>>>> 
>>>>>>>> This is a maintenance release in the 1.6.x series.  Bugs already
>>>>>>>> present in 1.6.0, missing features, or bugs related to new features 
>>>>>>>> will
>>>>>>>> not necessarily block this release.
>>>>>>>>
>>>>>>>> ===
>>>>>>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>>>>>>> ===
>>>>>>>> 1. It is OK for documentation patches to target 1.6.1 and still go
>>>>>>>> into branch-1.6, since documentations will be published separately 
>>>>>>>> from the
>>>>>>>> release.
>>>>>>>> 2. New features for non-alpha-modules should target 1.7+.
>>>>>>>> 3. Non-blocker bug fixes should target 1.6.2 or 2.0.0, or drop the
>>>>>>>> target version.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> *Sincerely yours Egor Pakhomov *
>>>>>
>>>>
>>>>
>>>
>>
>
>

[jira] [Resolved] (SPARK-13781) Use ExpressionSets in ConstraintPropagationSuite

2016-03-09 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13781.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11611
[https://github.com/apache/spark/pull/11611]

> Use ExpressionSets in ConstraintPropagationSuite
> 
>
> Key: SPARK-13781
> URL: https://issues.apache.org/jira/browse/SPARK-13781
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Small follow up on https://issues.apache.org/jira/browse/SPARK-13092 to use 
> ExpressionSets as part of the verification logic in 
> ConstraintPropagationSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13527) Prune Filters based on Constraints

2016-03-09 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13527:
-
Assignee: Xiao Li

> Prune Filters based on Constraints
> --
>
> Key: SPARK-13527
> URL: https://issues.apache.org/jira/browse/SPARK-13527
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Remove all the deterministic conditions in a [[Filter]] that are contained in 
> the Child.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13527) Prune Filters based on Constraints

2016-03-09 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13527.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11406
[https://github.com/apache/spark/pull/11406]

> Prune Filters based on Constraints
> --
>
> Key: SPARK-13527
> URL: https://issues.apache.org/jira/browse/SPARK-13527
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> Remove all the deterministic conditions in a [[Filter]] that are contained in 
> the Child.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13728) Fix ORC PPD

2016-03-09 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13728.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11593
[https://github.com/apache/spark/pull/11593]

> Fix ORC PPD
> ---
>
> Key: SPARK-13728
> URL: https://issues.apache.org/jira/browse/SPARK-13728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> Fix the ignored test "Enable ORC PPD" in OrcQuerySuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-09 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187626#comment-15187626
 ] 

Michael Armbrust commented on SPARK-13393:
--

No user is going to write {{df("a") == df("a")}} when they actually mean 
{{true}} and are looking for a Cartesian product, so I don't think the current 
behavior is unreasonable.  After we added that rewrite the amount of confusion 
on the mailing list seemed to go down and I have yet to have a user complain 
that they aren't getting a cartisian.

I believe that aliasing is the only reasonable solution here, and I've proposed 
that we eliminate the use of {{df("columnName")}} in most of our documentation 
because it give the illusion of us supporting something that we fundamentally 
cannot.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13763) Remove Project when its projectList is Empty

2016-03-09 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13763.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11599
[https://github.com/apache/spark/pull/11599]

> Remove Project when its projectList is Empty
> 
>
> Key: SPARK-13763
> URL: https://issues.apache.org/jira/browse/SPARK-13763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> We are using 'SELECT 1' as a dummy table, when the table is used for SQL 
> statements in which a table reference is required, but the contents of the 
> table are not important. For example, 
> {code}
> SELECT pageid, adid FROM (SELECT 1) dummyTable LATERAL VIEW 
> explode(adid_list) adTable AS adid;
> {code}
> In this case, we will see a useless Project whose projectList is empty after 
> executing ColumnPruning rule. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13754) Keep old data source name for backwards compatibility

2016-03-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13754:
-
Assignee: Hossein Falaki

> Keep old data source name for backwards compatibility
> -
>
> Key: SPARK-13754
> URL: https://issues.apache.org/jira/browse/SPARK-13754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>
> This data source was contributed by Databricks. It is the inlined version of 
> https://github.com/databricks/spark-csv. The data source name was 
> `com.databricks.spark.csv`. As a result there are many tables created on 
> older versions of spark with that name as the source. For backwards 
> compatibility we should keep the old name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13754) Keep old data source name for backwards compatibility

2016-03-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13754.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11589
[https://github.com/apache/spark/pull/11589]

> Keep old data source name for backwards compatibility
> -
>
> Key: SPARK-13754
> URL: https://issues.apache.org/jira/browse/SPARK-13754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
> Fix For: 2.0.0
>
>
> This data source was contributed by Databricks. It is the inlined version of 
> https://github.com/databricks/spark-csv. The data source name was 
> `com.databricks.spark.csv`. As a result there are many tables created on 
> older versions of spark with that name as the source. For backwards 
> compatibility we should keep the old name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13750.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11590
[https://github.com/apache/spark/pull/11590]

> Fix sizeInBytes for HadoopFSRelation
> 
>
> Key: SPARK-13750
> URL: https://issues.apache.org/jira/browse/SPARK-13750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 2.0.0
>
>
> [~davies] reports that {{sizeInBytes}} isnt correct anymore.  We should fix 
> that and make sure there is a test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-13750:


 Summary: Fix sizeInBytes for HadoopFSRelation
 Key: SPARK-13750
 URL: https://issues.apache.org/jira/browse/SPARK-13750
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker


[~davies] reports that {{sizeInBytes}} isnt correct anymore.  We should fix 
that and make sure there is a test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust

This is in active development, so there is not much that can be done from
an end user perspective.  In particular the only sink that is available in
apache/master is a testing sink that just stores the data in memory.  We
are working on a parquet based file sink and will eventually support all
the of Data Source API file formats (text, json, csv, orc, parquet).

On Tue, Mar 8, 2016 at 7:38 AM, Jacek Laskowski  wrote:

> Hi Praveen,
>
> I don't really know. I think TD or Michael should know as they
> personally involved in the task (as far as I could figure it out from
> the JIRA and the changes). Ping people on the JIRA so they notice your
> question(s).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devarao 
> wrote:
> > Thanks Jacek for the pointer.
> >
> > Any idea which package can be used in .format(). The test cases seem to
> work
> > out of the DefaultSource class defined within the
> DataFrameReaderWriterSuite
> > [org.apache.spark.sql.streaming.test.DefaultSource]
> >
> > Thanking You
> >
> -
> > Praveen Devarao
> > Spark Technology Centre
> > IBM India Software Labs
> >
> -
> > "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> > end of the day saying I will try again"
> >
> >
> >
> > From:Jacek Laskowski 
> > To:Praveen Devarao/India/IBM@IBMIN
> > Cc:user , dev 
> > Date:08/03/2016 04:17 pm
> > Subject:Re: Spark structured streaming
> > 
> >
> >
> >
> > Hi Praveen,
> >
> > I've spent few hours on the changes related to streaming dataframes
> > (included in the SPARK-8360) and concluded that it's currently only
> > possible to read.stream(), but not write.stream() since there are no
> > streaming Sinks yet.
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao 
> > wrote:
> >> Hi,
> >>
> >> I would like to get my hands on the structured streaming feature
> >> coming out in Spark 2.0. I have tried looking around for code samples to
> >> get
> >> started but am not able to find any. Only few things I could look into
> is
> >> the test cases that have been committed under the JIRA umbrella
> >> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't
> >> lead to building a example code as they seem to be working out of
> internal
> >> classes.
> >>
> >> Could anyone point me to some resources or pointers in code
> that I
> >> can start with to understand structured streaming from a consumability
> >> angle.
> >>
> >> Thanking You
> >>
> >>
> -
> >> Praveen Devarao
> >> Spark Technology Centre
> >> IBM India Software Labs
> >>
> >>
> -
> >> "Courage doesn't always roar. Sometimes courage is the quiet voice at
> the
> >> end of the day saying I will try again"
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust

This is in active development, so there is not much that can be done from
an end user perspective.  In particular the only sink that is available in
apache/master is a testing sink that just stores the data in memory.  We
are working on a parquet based file sink and will eventually support all
the of Data Source API file formats (text, json, csv, orc, parquet).

On Tue, Mar 8, 2016 at 7:38 AM, Jacek Laskowski  wrote:

> Hi Praveen,
>
> I don't really know. I think TD or Michael should know as they
> personally involved in the task (as far as I could figure it out from
> the JIRA and the changes). Ping people on the JIRA so they notice your
> question(s).
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 12:32 PM, Praveen Devarao 
> wrote:
> > Thanks Jacek for the pointer.
> >
> > Any idea which package can be used in .format(). The test cases seem to
> work
> > out of the DefaultSource class defined within the
> DataFrameReaderWriterSuite
> > [org.apache.spark.sql.streaming.test.DefaultSource]
> >
> > Thanking You
> >
> -
> > Praveen Devarao
> > Spark Technology Centre
> > IBM India Software Labs
> >
> -
> > "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> > end of the day saying I will try again"
> >
> >
> >
> > From:Jacek Laskowski 
> > To:Praveen Devarao/India/IBM@IBMIN
> > Cc:user , dev 
> > Date:08/03/2016 04:17 pm
> > Subject:Re: Spark structured streaming
> > 
> >
> >
> >
> > Hi Praveen,
> >
> > I've spent few hours on the changes related to streaming dataframes
> > (included in the SPARK-8360) and concluded that it's currently only
> > possible to read.stream(), but not write.stream() since there are no
> > streaming Sinks yet.
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Tue, Mar 8, 2016 at 10:38 AM, Praveen Devarao 
> > wrote:
> >> Hi,
> >>
> >> I would like to get my hands on the structured streaming feature
> >> coming out in Spark 2.0. I have tried looking around for code samples to
> >> get
> >> started but am not able to find any. Only few things I could look into
> is
> >> the test cases that have been committed under the JIRA umbrella
> >> https://issues.apache.org/jira/browse/SPARK-8360butthe test cases don't
> >> lead to building a example code as they seem to be working out of
> internal
> >> classes.
> >>
> >> Could anyone point me to some resources or pointers in code
> that I
> >> can start with to understand structured streaming from a consumability
> >> angle.
> >>
> >> Thanking You
> >>
> >>
> -
> >> Praveen Devarao
> >> Spark Technology Centre
> >> IBM India Software Labs
> >>
> >>
> -
> >> "Courage doesn't always roar. Sometimes courage is the quiet voice at
> the
> >> end of the day saying I will try again"
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 5500 matches

Mail list logo