You're right. A global sort would change the clustering if it had more fields than the clustering.
Then what about this: if there is no RequiredClustering, then the sort is a global sort. If RequiredClustering is present, then the clustering is applied and the sort is a partition-level sort. That rule would mean that within a partition you always get the sort, but an explicit clustering overrides the partitioning a sort might try to introduce. Does that sound reasonable? rb On Fri, Mar 30, 2018 at 12:39 PM, Patrick Woody <patrick.woo...@gmail.com> wrote: > Does that methodology work in this specific case? The ordering must be a > subset of the clustering to guarantee they exist in the same partition when > doing a global sort I thought. Though I get the gist that if it does > satisfy, then there is no reason to not choose the global sort. > > On Fri, Mar 30, 2018 at 1:31 PM, Ryan Blue <rb...@netflix.com> wrote: > >> > Can you expand on how the ordering containing the clustering >> expressions would ensure the global sort? >> >> The idea was to basically assume that if the clustering can be satisfied >> by a global sort, then do the global sort. For example, if the clustering >> is Set("b", "a") and the sort is Seq("a", "b", "c") then do a global sort >> by columns a, b, and c. >> >> Technically, you could do this with a hash partitioner instead of a range >> partitioner and sort within each partition, but that doesn't make much >> sense because the partitioning would ensure that each partition has just >> one combination of the required clustering columns. Using a hash >> partitioner would make it so that the in-partition sort basically ignores >> the first few values, so it must be that the intent was a global sort. >> >> On Fri, Mar 30, 2018 at 6:51 AM, Patrick Woody <patrick.woo...@gmail.com> >> wrote: >> >>> Right, you could use this to store a global ordering if there is only >>>> one write (e.g., CTAS). I don’t think anything needs to change in that >>>> case, you would still have a clustering and an ordering, but the ordering >>>> would need to include all fields of the clustering. A way to pass in the >>>> partition ordinal for the source to store would be required. >>> >>> >>> Can you expand on how the ordering containing the clustering expressions >>> would ensure the global sort? Having an RangePartitioning would certainly >>> satisfy, but it isn't required - is the suggestion that if Spark sees this >>> overlap, then it plans a global sort? >>> >>> On Thu, Mar 29, 2018 at 12:16 PM, Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> @RyanBlue I'm hoping that through the CBO effort we will continue to >>>> get more detailed statistics. Like on read we could be using sketch data >>>> structures to get estimates on unique values and density for each column. >>>> You may be right that the real way for this to be handled would be giving a >>>> "cost" back to a higher order optimizer which can decide which method to >>>> use rather than having the data source itself do it. This is probably in a >>>> far future version of the api. >>>> >>>> On Thu, Mar 29, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> Cassandra can insert records with the same partition-key faster if >>>>> they arrive in the same payload. But this is only beneficial if the >>>>> incoming dataset has multiple entries for the same partition key. >>>>> >>>>> Thanks for the example, the recommended partitioning use case makes >>>>> more sense now. I think we could have two interfaces, a >>>>> RequiresClustering and a RecommendsClustering if we want to support >>>>> this. But I’m skeptical it will be useful for two reasons: >>>>> >>>>> - Do we want to optimize the low cardinality case? Shuffles are >>>>> usually much cheaper at smaller sizes, so I’m not sure it is necessary >>>>> to >>>>> optimize this away. >>>>> - How do we know there isn’t just a few partition keys for all the >>>>> records? It may look like a shuffle wouldn’t help, but we don’t know >>>>> the >>>>> partition keys until it is too late. >>>>> >>>>> Then there’s also the logic for avoiding the shuffle and how to >>>>> calculate the cost, which sounds like something that needs some details >>>>> from CBO. >>>>> >>>>> I would assume that given the estimated data size from Spark and >>>>> options passed in from the user, the data source could make a more >>>>> intelligent requirement on the write format than Spark independently. >>>>> >>>>> This is a good point. >>>>> >>>>> What would an implementation actually do here and how would >>>>> information be passed? For my use cases, the store would produce the >>>>> number >>>>> of tasks based on the estimated incoming rows, because the source has the >>>>> best idea of how the rows will compress. But, that’s just applying a >>>>> multiplier most of the time. To be very useful, this would have to handle >>>>> skew in the rows (think row with a type where total size depends on type) >>>>> and that’s a bit harder. I think maybe an interface that can provide >>>>> relative cost estimates based on partition keys would be helpful, but then >>>>> keep the planning logic in Spark. >>>>> >>>>> This is probably something that we could add later as we find use >>>>> cases that require it? >>>>> >>>>> I wouldn’t assume that a data source requiring a certain write format >>>>> would give any guarantees around reading the same data? In the cases where >>>>> it is a complete overwrite it would, but for independent writes it could >>>>> still be useful for statistics or compression. >>>>> >>>>> Right, you could use this to store a global ordering if there is only >>>>> one write (e.g., CTAS). I don’t think anything needs to change in that >>>>> case, you would still have a clustering and an ordering, but the ordering >>>>> would need to include all fields of the clustering. A way to pass in the >>>>> partition ordinal for the source to store would be required. >>>>> >>>>> For the second point that ordering is useful for statistics and >>>>> compression, I completely agree. Our best practices doc tells users to >>>>> always add a global sort when writing because you get the benefit of a >>>>> range partitioner to handle skew, plus the stats and compression you’re >>>>> talking about to optimize for reads. I think the proposed API can request >>>>> a >>>>> global ordering from Spark already. My only point is that there isn’t much >>>>> the source can do to guarantee ordering for reads when there is more than >>>>> one write. >>>>> >>>>> >>>>> On Wed, Mar 28, 2018 at 7:14 PM, Patrick Woody < >>>>> patrick.woo...@gmail.com> wrote: >>>>> >>>>>> Spark would always apply the required clustering and sort order >>>>>>> because they are required by the data source. It is reasonable for a >>>>>>> source >>>>>>> to reject data that isn’t properly prepared. For example, data must be >>>>>>> written to HTable files with keys in order or else the files are >>>>>>> invalid. >>>>>>> Sorting should not be implemented in the sources themselves because >>>>>>> Spark >>>>>>> handles concerns like spilling to disk. Spark must prepare data >>>>>>> correctly, >>>>>>> which is why the interfaces start with “Requires”. >>>>>> >>>>>> >>>>>> This was in reference to Russell's suggestion that the data source >>>>>> could have a required sort, but only a recommended partitioning. I don't >>>>>> have an immediate recommending use case that would come to mind though. >>>>>> I'm >>>>>> definitely in sync that the data source itself shouldn't do work outside >>>>>> of >>>>>> the writes themselves. >>>>>> >>>>>> Considering the second use case you mentioned first, I don’t think it >>>>>>> is a good idea for a table to put requirements on the number of tasks >>>>>>> used >>>>>>> for a write. The parallelism should be set appropriately for the data >>>>>>> volume, which is for Spark or the user to determine. A minimum or >>>>>>> maximum >>>>>>> number of tasks could cause bad behavior. >>>>>> >>>>>> >>>>>> For your first use case, an explicit global ordering, the problem is >>>>>>> that there can’t be an explicit global ordering for a table when it is >>>>>>> populated by a series of independent writes. Each write could have a >>>>>>> global >>>>>>> order, but once those files are written, you have to deal with multiple >>>>>>> sorted data sets. I think it makes sense to focus on order within data >>>>>>> files, not order between data files. >>>>>> >>>>>> >>>>>> This is where I'm interested in learning about the separation of >>>>>> responsibilities for the data source and how "smart" it is supposed to >>>>>> be. >>>>>> >>>>>> For the first part, I would assume that given the estimated data size >>>>>> from Spark and options passed in from the user, the data source could >>>>>> make >>>>>> a more intelligent requirement on the write format than Spark >>>>>> independently. Somewhat analogous to how the current FileSource does bin >>>>>> packing of small files on the read side, restricting parallelism for the >>>>>> sake of overhead. >>>>>> >>>>>> For the second, I wouldn't assume that a data source requiring a >>>>>> certain write format would give any guarantees around reading the same >>>>>> data? In the cases where it is a complete overwrite it would, but for >>>>>> independent writes it could still be useful for statistics or >>>>>> compression. >>>>>> >>>>>> Thanks >>>>>> Pat >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Mar 28, 2018 at 8:28 PM, Ryan Blue <rb...@netflix.com> wrote: >>>>>> >>>>>>> How would Spark determine whether or not to apply a recommendation - >>>>>>> a cost threshold? >>>>>>> >>>>>>> Spark would always apply the required clustering and sort order >>>>>>> because they are required by the data source. It is reasonable for a >>>>>>> source >>>>>>> to reject data that isn’t properly prepared. For example, data must be >>>>>>> written to HTable files with keys in order or else the files are >>>>>>> invalid. >>>>>>> Sorting should not be implemented in the sources themselves because >>>>>>> Spark >>>>>>> handles concerns like spilling to disk. Spark must prepare data >>>>>>> correctly, >>>>>>> which is why the interfaces start with “Requires”. >>>>>>> >>>>>>> I’m not sure what the second half of your question means. What does >>>>>>> Spark need to pass into the data source? >>>>>>> >>>>>>> Should a datasource be able to provide a Distribution proper rather >>>>>>> than just the clustering expressions? Two use cases would be for >>>>>>> explicit >>>>>>> global sorting of the dataset and attempting to ensure a minimum write >>>>>>> task >>>>>>> size/number of write tasks. >>>>>>> >>>>>>> Considering the second use case you mentioned first, I don’t think >>>>>>> it is a good idea for a table to put requirements on the number of tasks >>>>>>> used for a write. The parallelism should be set appropriately for the >>>>>>> data >>>>>>> volume, which is for Spark or the user to determine. A minimum or >>>>>>> maximum >>>>>>> number of tasks could cause bad behavior. >>>>>>> >>>>>>> That said, I think there is a related use case for sharding. But >>>>>>> that’s really just a clustering by an expression with the shard >>>>>>> calculation, e.g., hash(id_col, 64). The shards should be handled >>>>>>> as a cluster, but it doesn’t matter how many tasks are used for it. >>>>>>> >>>>>>> For your first use case, an explicit global ordering, the problem is >>>>>>> that there can’t be an explicit global ordering for a table when it is >>>>>>> populated by a series of independent writes. Each write could have a >>>>>>> global >>>>>>> order, but once those files are written, you have to deal with multiple >>>>>>> sorted data sets. I think it makes sense to focus on order within data >>>>>>> files, not order between data files. >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 28, 2018 at 7:26 AM, Patrick Woody < >>>>>>> patrick.woo...@gmail.com> wrote: >>>>>>> >>>>>>>> How would Spark determine whether or not to apply a recommendation >>>>>>>> - a cost threshold? And yes, it would be good to flesh out what >>>>>>>> information >>>>>>>> we get from Spark in the datasource when providing these >>>>>>>> recommendations/requirements - I could see statistics and the existing >>>>>>>> outputPartitioning/Ordering of the child plan being used for providing >>>>>>>> the >>>>>>>> requirement. >>>>>>>> >>>>>>>> Should a datasource be able to provide a Distribution proper rather >>>>>>>> than just the clustering expressions? Two use cases would be for >>>>>>>> explicit >>>>>>>> global sorting of the dataset and attempting to ensure a minimum write >>>>>>>> task >>>>>>>> size/number of write tasks. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 27, 2018 at 7:59 PM, Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks for the clarification, definitely would want to require >>>>>>>>> Sort but only recommend partitioning ... I think that would be >>>>>>>>> useful to >>>>>>>>> request based on details about the incoming dataset. >>>>>>>>> >>>>>>>>> On Tue, Mar 27, 2018 at 4:55 PM Ryan Blue <rb...@netflix.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> A required clustering would not, but a required sort would. >>>>>>>>>> Clustering is asking for the input dataframe's partitioning, and >>>>>>>>>> sorting >>>>>>>>>> would be how each partition is sorted. >>>>>>>>>> >>>>>>>>>> On Tue, Mar 27, 2018 at 4:53 PM, Russell Spitzer < >>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I forgot since it's been a while, but does Clustering support >>>>>>>>>>> allow requesting that partitions contain elements in order as well? >>>>>>>>>>> That >>>>>>>>>>> would be a useful trick for me. IE >>>>>>>>>>> Request/Require(SortedOn(Col1)) >>>>>>>>>>> Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2)) >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 27, 2018 at 4:38 PM Ryan Blue >>>>>>>>>>> <rb...@netflix.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks, it makes sense that the existing interface is for >>>>>>>>>>>> aggregation and not joins. Why are there requirements for the >>>>>>>>>>>> number of >>>>>>>>>>>> partitions that are returned then? >>>>>>>>>>>> >>>>>>>>>>>> Does it makes sense to design the write-side `Requirement` >>>>>>>>>>>> classes and the read-side reporting separately? >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 27, 2018 at 3:56 PM, Wenchen Fan < >>>>>>>>>>>> cloud0...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Ryan, yea you are right that SupportsReportPartitioning >>>>>>>>>>>>> doesn't expose hash function, so Join can't benefit from this >>>>>>>>>>>>> interface, as >>>>>>>>>>>>> Join doesn't require a general ClusteredDistribution, but a more >>>>>>>>>>>>> specific >>>>>>>>>>>>> one called HashClusteredDistribution. >>>>>>>>>>>>> >>>>>>>>>>>>> So currently only Aggregate can benefit from >>>>>>>>>>>>> SupportsReportPartitioning and save shuffle. We can add a new >>>>>>>>>>>>> interface to >>>>>>>>>>>>> expose the hash function to make it work for Join. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 27, 2018 at 9:33 AM, Ryan Blue <rb...@netflix.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I just took a look at SupportsReportPartitioning and I'm not >>>>>>>>>>>>>> sure that it will work for real use cases. It doesn't specify, >>>>>>>>>>>>>> as far as I >>>>>>>>>>>>>> can tell, a hash function for combining clusters into tasks or a >>>>>>>>>>>>>> way to >>>>>>>>>>>>>> provide Spark a hash function for the other side of a join. It >>>>>>>>>>>>>> seems >>>>>>>>>>>>>> unlikely to me that many data sources would have partitioning >>>>>>>>>>>>>> that happens >>>>>>>>>>>>>> to match the other side of a join. And, it looks like task order >>>>>>>>>>>>>> matters? >>>>>>>>>>>>>> Maybe I'm missing something? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think that we should design the write side independently >>>>>>>>>>>>>> based on what data stores actually need, and take a look at the >>>>>>>>>>>>>> read side >>>>>>>>>>>>>> based on what data stores can actually provide. Wenchen, was >>>>>>>>>>>>>> there a design >>>>>>>>>>>>>> doc for partitioning on the read path? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I completely agree with your point about a global sort. We >>>>>>>>>>>>>> recommend to all of our data engineers to add a sort to most >>>>>>>>>>>>>> tables because >>>>>>>>>>>>>> it introduces the range partitioner and does a skew calculation, >>>>>>>>>>>>>> in >>>>>>>>>>>>>> addition to making data filtering much better when it is read. >>>>>>>>>>>>>> It's really >>>>>>>>>>>>>> common for tables to be skewed by partition values. >>>>>>>>>>>>>> >>>>>>>>>>>>>> rb >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 7:59 PM, Patrick Woody < >>>>>>>>>>>>>> patrick.woo...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hey Ryan, Ted, Wenchen >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for the quick replies. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> @Ryan - the sorting portion makes sense, but I think we'd >>>>>>>>>>>>>>> have to ensure something similar to requiredChildDistribution >>>>>>>>>>>>>>> in SparkPlan >>>>>>>>>>>>>>> where we have the number of partitions as well if we'd want to >>>>>>>>>>>>>>> further >>>>>>>>>>>>>>> report to SupportsReportPartitioning, yeah? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Specifying an explicit global sort can also be useful for >>>>>>>>>>>>>>> filtering purposes on Parquet row group stats if we have a time >>>>>>>>>>>>>>> based/high >>>>>>>>>>>>>>> cardinality ID field. If my datasource or catalog knows about >>>>>>>>>>>>>>> previous >>>>>>>>>>>>>>> queries on a table, it could be really useful to recommend more >>>>>>>>>>>>>>> appropriate >>>>>>>>>>>>>>> formatting for consumers on the next materialization. The same >>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>> true of clustering on commonly joined fields. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks again >>>>>>>>>>>>>>> Pat >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 10:05 PM, Ted Yu < >>>>>>>>>>>>>>> yuzhih...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hmm. Ryan seems to be right. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Looking at sql/core/src/main/java/org/ >>>>>>>>>>>>>>>> apache/spark/sql/sources/v2/re >>>>>>>>>>>>>>>> ader/SupportsReportPartitioning.java : >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> import org.apache.spark.sql.sources.v >>>>>>>>>>>>>>>> 2.reader.partitioning.Partitioning; >>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>> Partitioning outputPartitioning(); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 6:58 PM, Wenchen Fan < >>>>>>>>>>>>>>>> cloud0...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Actually clustering is already supported, please take a >>>>>>>>>>>>>>>>> look at SupportsReportPartitioning >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ordering is not proposed yet, might be similar to what >>>>>>>>>>>>>>>>> Ryan proposed. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu < >>>>>>>>>>>>>>>>> yuzhih...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Interesting. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Should requiredClustering return a Set of Expression's ? >>>>>>>>>>>>>>>>>> This way, we can determine the order of Expression's by >>>>>>>>>>>>>>>>>> looking at what requiredOrdering() returns. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue < >>>>>>>>>>>>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Pat, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for starting the discussion on this, we’re really >>>>>>>>>>>>>>>>>>> interested in it as well. I don’t think there is a proposed >>>>>>>>>>>>>>>>>>> API yet, but I >>>>>>>>>>>>>>>>>>> was thinking something like this: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> interface RequiresClustering { >>>>>>>>>>>>>>>>>>> List<Expression> requiredClustering(); >>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> interface RequiresSort { >>>>>>>>>>>>>>>>>>> List<SortOrder> requiredOrdering(); >>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The reason why RequiresClustering should provide >>>>>>>>>>>>>>>>>>> Expression is that it needs to be able to customize the >>>>>>>>>>>>>>>>>>> implementation. For example, writing to HTable would >>>>>>>>>>>>>>>>>>> require building a key >>>>>>>>>>>>>>>>>>> (or the data for a key) and that might use a hash function >>>>>>>>>>>>>>>>>>> that differs >>>>>>>>>>>>>>>>>>> from Spark’s built-ins. RequiresSort is fairly >>>>>>>>>>>>>>>>>>> straightforward, but the interaction between the two >>>>>>>>>>>>>>>>>>> requirements deserves >>>>>>>>>>>>>>>>>>> some consideration. To make the two compatible, I think that >>>>>>>>>>>>>>>>>>> RequiresSort must be interpreted as a sort within each >>>>>>>>>>>>>>>>>>> partition of the clustering, but could possibly be used for >>>>>>>>>>>>>>>>>>> a global sort >>>>>>>>>>>>>>>>>>> when the two overlap. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> For example, if I have a table partitioned by “day” and >>>>>>>>>>>>>>>>>>> “category” then the RequiredClustering would be by day, >>>>>>>>>>>>>>>>>>> category. A required sort might be day ASC, category >>>>>>>>>>>>>>>>>>> DESC, name ASC. Because that sort satisfies the >>>>>>>>>>>>>>>>>>> required clustering, it could be used for a global >>>>>>>>>>>>>>>>>>> ordering. But, is that >>>>>>>>>>>>>>>>>>> useful? How would the global ordering matter beyond a sort >>>>>>>>>>>>>>>>>>> within each >>>>>>>>>>>>>>>>>>> partition, i.e., how would the partition’s place in the >>>>>>>>>>>>>>>>>>> global ordering be >>>>>>>>>>>>>>>>>>> passed? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> To your other questions, you might want to have a look >>>>>>>>>>>>>>>>>>> at the recent SPIP I’m working on to consolidate and >>>>>>>>>>>>>>>>>>> clean up logical plans >>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>. >>>>>>>>>>>>>>>>>>> That proposes more specific uses for the DataSourceV2 API >>>>>>>>>>>>>>>>>>> that should help >>>>>>>>>>>>>>>>>>> clarify what validation needs to take place. As for custom >>>>>>>>>>>>>>>>>>> catalyst rules, >>>>>>>>>>>>>>>>>>> I’d like to hear about the use cases to see if we can build >>>>>>>>>>>>>>>>>>> it into these >>>>>>>>>>>>>>>>>>> improvements. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> rb >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Mon, Mar 26, 2018 at 8:40 AM, Patrick Woody < >>>>>>>>>>>>>>>>>>> patrick.woo...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hey all, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I saw in some of the discussions around DataSourceV2 >>>>>>>>>>>>>>>>>>>> writes that we might have the data source inform Spark of >>>>>>>>>>>>>>>>>>>> requirements for >>>>>>>>>>>>>>>>>>>> the input data's ordering and partitioning. Has there been >>>>>>>>>>>>>>>>>>>> a proposed API >>>>>>>>>>>>>>>>>>>> for that yet? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Even one level up it would be helpful to understand how >>>>>>>>>>>>>>>>>>>> I should be thinking about the responsibility of the data >>>>>>>>>>>>>>>>>>>> source writer, >>>>>>>>>>>>>>>>>>>> when I should be inserting a custom catalyst rule, and how >>>>>>>>>>>>>>>>>>>> I should handle >>>>>>>>>>>>>>>>>>>> validation/assumptions of the table before attempting the >>>>>>>>>>>>>>>>>>>> write. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>> Pat >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>>>>>>> Software Engineer >>>>>>>>>>>>>>>>>>> Netflix >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>>> Software Engineer >>>>>>>>>>>>>> Netflix >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Ryan Blue >>>>>>>>>>>> Software Engineer >>>>>>>>>>>> Netflix >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Software Engineer >>>>>>>>>> Netflix >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Ryan Blue Software Engineer Netflix