Re: pre-filtered hadoop RDD use case

2014-07-29 Thread Reynold Xin
Would something like this help? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PartitionPruningRDD.scala On Thu, Jul 24, 2014 at 8:40 AM, Eugene Cheipesh echeip...@gmail.com wrote: Hello, I have an interesting use case for a pre-filtered RDD. I have

Re: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-29 Thread Nicholas Chammas
- spun up an EC2 cluster successfully using spark-ec2 - tested S3 file access from that cluster successfully +1 ​ On Tue, Jul 29, 2014 at 1:46 AM, Henry Saputra henry.sapu...@gmail.com wrote: NOTICE and LICENSE files look good Hashes and sigs look good No executable in the source

RE: pre-filtered hadoop RDD use case

2014-07-29 Thread Yan Zhou.sc
PartitionPruningRDD.scala still only handles, as said, the partition portion of the issue. On the record pruning portion, although cheap fixes could be available for this issue as reported, but I believe a fundamental issue is lack of a mechanism of processing merging/pushdown. Given the

Re: pre-filtered hadoop RDD use case

2014-07-29 Thread Reynold Xin
I am not sure if I agree that it lacks the mechanism to do pushdowns. Hadoop InputFormat itself provides some basic mechanism to push down predicates already. The HBase InputFormat already implements it. In Spark, you can also run arbitrary user code, and you can decide what to do. You can also

RE: pre-filtered hadoop RDD use case

2014-07-29 Thread Yan Zhou.sc
Hi Reynold, I agree that we should not hurry right now to modify/enhance APIs and could be satisfied with extending existing ones as much as possible. On the other hand, more intelligent data stores like HBase or Cassendra do support complex pushdowns, often more complex than their MR

JIRA content request

2014-07-29 Thread Mark Hamstra
Of late, I've been coming across quite a few pull requests and associated JIRA issues that contain nothing indicating their purpose beyond a pretty minimal description of what the pull request does. On the pull request itself, a reference to the corresponding JIRA in the title combined with a

Re: JIRA content request

2014-07-29 Thread Reynold Xin
+1 on this. On Tue, Jul 29, 2014 at 4:34 PM, Mark Hamstra m...@clearstorydata.com wrote: Of late, I've been coming across quite a few pull requests and associated JIRA issues that contain nothing indicating their purpose beyond a pretty minimal description of what the pull request does. On

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-29 Thread Erik Erlandson
- Original Message - Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but

Re: JIRA content request

2014-07-29 Thread Nicholas Chammas
+1 on using JIRA workflows to manage the backlog, and +9000 on having decent descriptions for all JIRA issues. On Tue, Jul 29, 2014 at 7:48 PM, Sean Owen so...@cloudera.com wrote: How about using a JIRA status like Documentation Required to mean burden's on you to elaborate with a motivation

Re: JIRA content request

2014-07-29 Thread Matei Zaharia
I agree as well. FWIW sometimes I've seen this happen due to language barriers, i.e. contributors whose primary language is not English, but we need more motivation for each change. On July 29, 2014 at 5:12:01 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: +1 on using JIRA workflows