[GitHub] spark pull request: [SPARK-1266] persist factors in implicit ALS

2014-03-18 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/165#discussion_r10714229 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala --- @@ -187,21 +189,39 @@ class ALS private

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-08 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100229300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-08 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100229358 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-09 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100360523 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the

[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...

2017-02-09 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16785#discussion_r100364260 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -314,19 +322,29 @@ abstract class UnaryNode

[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-09 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16775 @viirya I believe this PR meshes with the refactoring and application to pregel GraphX algorithms in #15125. Basically, it moves the periodic checkpointing code from mllib into core and uses it in

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 >> Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 BTW @budde, given that this represents a regression in behavior from previous versions of Spark, I think it is too generous of you to label the Jira issue as an "improvement" instead of

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100608839 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" bu

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100609529 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" bu

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100612840 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" bu

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100631975 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointerSuite.scala --- @@ -21,6 +21,7 @@ import org.apache.hadoop.fs.Path

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100632148 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala --- @@ -23,7 +23,7 @@ import org.apache.spark.{SparkContext

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100638130 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -76,7 +77,7 @@ import

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100638292 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" bu

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100640256 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" bu

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100641170 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -87,10 +88,7 @@ private[mllib] class

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-10 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @viirya @dding3 I'm going to rerun our big connected components computation with the changes I've suggested to validate that it still performs and completes as expected. Given the time r

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3 These latest changes look great. I'll run our big connected components job today and report back. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 Our connected components computation completed successfully, with performance as expected. I've created a PR against @dding3's PR branch to incorporate a couple simple things. Then I t

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @viirya I've added a commit to address some of your feedback. I will have another commit to address the others, but I'm not sure when I'll have it in. Hopefully by the end of next

[GitHub] spark issue #16942: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-15 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16942 Force pushing your branch shouldn't close the PR. You didn't close it manually? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub a

[GitHub] spark issue #16942: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-15 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16942 Weird. I think I've seen that behavior once before. But I think the only time I force push on a PR is to rebase. Maybe that's the only kind of force push allowed for Github PRs. -

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101592331 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101602014 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101602604 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 > I think @mallman is saying he would merge changes to @dding3 branch Yes, or I could do them in a follow up PR. Or @dding3 could do them without my PR. I'm not hung up on gettin

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3 I submitted a PR against your `cp2_pregel` branch. If you merge that PR into your branch, it will be reflected in this PR. This is my PR: https://github.com/dding3/spark/pull/1. --- If

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 LGTM. @felixcheung are we good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675576 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675669 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101809099 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101809872 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101818789 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala --- @@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101819321 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -87,10 +87,10 @@ private[mllib] class

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102053462 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -154,7 +169,9 @@ object Pregel extends Logging { // count the

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102056438 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additio

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102057156 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additio

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-20 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3, thank you for your continued patience and dedication to this PR, despite the continued change requests. We are getting closer to a merge. --- If your project is set up for it, you can

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102271763 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102272981 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102290972 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additio

[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102292537 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -122,27 +125,39 @@ object Pregel extends Logging { require

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102293219 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -843,7 +852,15 @@ private[spark] class BlockManager

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102293681 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -317,6 +317,9 @@ private[spark] class BlockManager

[GitHub] spark issue #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` gro...

2017-01-04 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15480 Hi @lw-lin. Just FYI we use this patch at VideoAmp and would love to see it merged in. I notice this PR has gone a little cold. I'm sorry I can't offer much concrete help, but I wante

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-01-07 Thread mallman
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/16499 [SPARK-17204][CORE] Fix replicated off heap storage (Jira: https://issues.apache.org/jira/browse/SPARK-17204) ## What changes were proposed in this pull request? There are a

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-01-07 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r95066296 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -387,12 +388,23 @@ class BlockManagerReplicationSuite

[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-01-07 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r95066452 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala --- @@ -375,7 +375,8 @@ class BlockManagerReplicationSuite

[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-09 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95206030 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable

[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location

2017-01-09 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16514 > A good suggestion. Will do the code changes tomorrow. Thanks! I look forward to seeing this. Thanks for taking this on. --- If your project is set up for it, you can reply to this em

[GitHub] spark pull request #16514: [SPARK-19128] [SQL] Refresh Cache after Set Locat...

2017-01-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16514#discussion_r95489960 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -119,7 +119,30 @@ private[hive] class HiveMetastoreCatalog

[GitHub] spark pull request #16514: [SPARK-19128] [SQL] Refresh Cache after Set Locat...

2017-01-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16514#discussion_r95490619 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -555,6 +557,61 @@ class HiveDDLSuite

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-13 Thread mallman
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/16578 [SPARK-4502][SQL] Parquet nested column pruning (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502) ## What changes were proposed in this pull request? One of the

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 cc @rxin @ericl @cloud-fan @marmbrus I would love to get your feedback on this if you have the time. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage

2017-01-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 @rxin, can you recommend someone I reach out to for help reviewing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #16499: [SPARK-17204][CORE] Fix replicated off heap storage

2017-01-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16499 Josh, can you take a look at this when you have a chance? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-16 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > Does this take over #14957? If so, we might need Closes #14957 in the PR description for the merge script to close that one or let the author know this takes over that. I don

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-07 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I've rebased this PR and refactored it somewhat. The main change is to move the partition pruning logic from `FileSourceStrategy` into a Catalyst optimizer rule called `PruneFileSourceParti

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82712965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -477,6 +478,15 @@ class InMemoryCatalog

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82713068 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,16 @@ case class FileSourceScanExec

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-10 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82713318 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,16 @@ case class FileSourceScanExec

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I would be wary of amending our data sources to support case-insensitive field resolution. For one thing, strictly speaking it can lead to ambiguity in schema resolution. In the—potential but

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 BTW, I'm working on a rebase to fix merge conflicts and address reviewers' feedback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitH

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-11 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82850487 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,16 @@ case class FileSourceScanExec

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 Ah cripes. I committed something I didn't want to. I'm rebasing again in a few... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub a

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I believe that using a method like `TableFileCatalog.filterPartitions` to build a new file catalog restricted to some pruned partitions is a sound approach, however I'm starting to reconside

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-11 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 >> Finally, this would require us to read the schema files. That's something I'm trying to avoid in this patch. > Not sure what you mean here, but the parquet change

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-11 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82900810 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,16 @@ case class FileSourceScanExec

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-11 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r82902172 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,16 @@ case class FileSourceScanExec

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-12 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I updated the description of this PR to reflect the workaround for the Hive/Parquet case-sensitivity issue. Do we need a similar workaround for ORC? --- If your project is set up for it

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-12 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83036315 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala --- @@ -0,0 +1,72 @@ +/* + * Licensed

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-12 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I'm testing this patch on a couple of tables internally with on the order of 10k partitions. Performance is much slower than it should be. I'm investigating. --- If your project is set

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-12 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 > Btw I've noticed a significant performance difference between ListingFileCatalog and TableFileCatalog's implementation of ListFiles. The difference seems to be that Listi

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-12 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 >> Btw I've noticed a significant performance difference between ListingFileCatalog and TableFileCatalog's implementation of ListFiles. The difference seems to be that Li

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-12 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I determined the performance regression was introduced by a commit I hadn't pushed to this PR. Sorry for the false alarm. 😞 Needless to say, I'm not pushing that commit. --- If your

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-12 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83085625 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TableFileCatalog.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-12 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83085289 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -616,6 +617,44 @@ private[spark] class HiveExternalCatalog(conf

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-12 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83086945 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala --- @@ -0,0 +1,72 @@ +/* + * Licensed

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-12 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83115827 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -199,59 +197,30 @@ private[hive] class HiveMetastoreCatalog

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-12 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83131630 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -199,59 +197,30 @@ private[hive] class HiveMetastoreCatalog

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-12 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83141382 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -616,6 +617,44 @@ private[spark] class HiveExternalCatalog(conf

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I've pushed an update to `ParquetMetastoreSuite` that illustrates the bug (or "limitation") WRT support for mixed-case partition columns I discovered yesterday. To reiterate

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 > This patch fails MiMa tests. I've never seen this before. What does this mean? --- If your project is set up for it, you can reply to this email and have your reply appear on G

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 > Btw, I noticed that this suite was failing in jenkins only. > > [info] - partitioned pruned table reports only selected files *** FAILED *** (610 milliseconds) > >

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-13 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83318096 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -616,6 +617,44 @@ private[spark] class HiveExternalCatalog(conf

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-13 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83325088 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala --- @@ -0,0 +1,72 @@ +/* + * Licensed

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-13 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83325529 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -225,13 +225,19 @@ case class FileSourceScanExec

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-13 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 > Oops there is a conflict now. NP. I'm working on the rebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well.

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-13 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83349072 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -34,7 +34,7 @@ import org.apache.spark.util.Utils // The data

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-13 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83352979 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -626,6 +627,40 @@ private[spark] class HiveExternalCatalog(conf

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-13 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83355030 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -34,7 +34,7 @@ import org.apache.spark.util.Utils // The data

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I will work on a rebase. Meanwhile, I've revisited the open issues in the PR description. To summarize: 1. Do we need a workaround for ORC like we made for Parquet? 1. What's

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I'm still working on the rebase. It's very complex—there are two other commits involved. >> 1. Do we need a workaround for ORC like we made for Parquet? > 1)

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 I just pushed the rebase. It was really hairy, but I tried hard to ensure I got essentially all three branches' changes in. --- If your project is set up for it, you can reply to this emai

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 > btw, what's the parquet log redirection issue? I don't see anything unusual in spark shell. Whenever I run a query on a Hive parquet table I get ``` spark-sq

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 > Hm, I haven't seen that with my test queries. Would adding your workaround to SparkILoopInit work? It does not, unfortunately. --- If your project is set up for it, you can

[GitHub] spark issue #14690: [SPARK-16980][SQL] Load only catalog table partition met...

2016-10-14 Thread mallman
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14690 >> Hm, I haven't seen that with my test queries. Would adding your workaround to SparkILoopInit work? > It does not, unfortunately. I believe this impacts people with

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-14 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83517834 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala --- @@ -17,32 +17,26 @@ package

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-10-14 Thread mallman
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14690#discussion_r83518291 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -626,6 +627,40 @@ private[spark] class HiveExternalCatalog(conf

  1   2   3   4   5   6   7   >