To give you a broader idea of the current use case, I have a few transformations (sort and column creations) oriented towards a simple goal. My data is timestamped and if two lines are identical, that time difference will have to be more than X days in order to be kept, so there are a few shifts done but very locally : only -1 or +1.
FYI regarding JIRA, i created one - https://issues.apache.org/jira/browse/SPARK-7247 - associated to this discussion. @rxin considering, in my use case, the data is sorted beforehand, there might be a better way - but I guess some shuffle would needed anyway... Le mer. 29 avr. 2015 à 22:34, Evan R. Sparks <evan.spa...@gmail.com> a écrit : > In general there's a tension between ordered data and set-oriented data > model underlying DataFrames. You can force a total ordering on the data, > but it may come at a high cost with respect to performance. > > It would be good to get a sense of the use case you're trying to support, > but one suggestion would be to apply I can imagine achieving a similar > result by applying a datetime.timedelta (in Python terms) to a time > attribute (your "axis") and then performing join between the base table and > this derived table to merge the data back together. This type of join could > then be optimized if the use case is frequent enough to warrant it. > > - Evan > > On Wed, Apr 29, 2015 at 1:25 PM, Reynold Xin <r...@databricks.com> wrote: > >> In this case it's fine to discuss whether this would fit in Spark >> DataFrames' high level direction before putting it in JIRA. Otherwise we >> might end up creating a lot of tickets just for querying whether something >> might be a good idea. >> >> About this specific feature -- I'm not sure what it means in general given >> we don't have axis in Spark DataFrames. But I think it'd probably be good >> to be able to shift a column by one so we can support the end time / begin >> time case, although it'd require two passes over the data. >> >> >> >> On Wed, Apr 29, 2015 at 1:08 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >> > I can't comment on the direction of the DataFrame API (that's more for >> > Reynold or Michael I guess), but I just wanted to point out that the >> JIRA >> > would be the recommended way to create a central place for discussing a >> > feature add like that. >> > >> > Nick >> > >> > On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot < >> > o.girar...@lateral-thoughts.com> wrote: >> > >> > > Hi Nicholas, >> > > yes I've already checked, and I've just created the >> > > https://issues.apache.org/jira/browse/SPARK-7247 >> > > I'm not even sure why this would be a good feature to add except the >> fact >> > > that some of the data scientists I'm working with are using it, and it >> > > would be therefore useful for me to translate Pandas code to Spark... >> > > >> > > Isn't the goal of Spark Dataframe to allow all the features of >> Pandas/R >> > > Dataframe using Spark ? >> > > >> > > Regards, >> > > >> > > Olivier. >> > > >> > > Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas < >> > nicholas.cham...@gmail.com> >> > > a écrit : >> > > >> > >> You can check JIRA for any existing plans. If there isn't any, then >> feel >> > >> free to create a JIRA and make the case there for why this would be a >> > good >> > >> feature to add. >> > >> >> > >> Nick >> > >> >> > >> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot < >> > >> o.girar...@lateral-thoughts.com> wrote: >> > >> >> > >>> Hi, >> > >>> Is there any plan to add the "shift" method from Pandas to Spark >> > >>> Dataframe, >> > >>> not that I think it's an easy task... >> > >>> >> > >>> c.f. >> > >>> >> > >>> >> > >> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html >> > >>> >> > >>> Regards, >> > >>> >> > >>> Olivier. >> > >>> >> > >> >> > >> > >