> > To save the effort, or invest it to higher priority issues, we plan to: > 1. We will stop providing "additional algorithms", unless they are > explictly required.
This sounds reasonable, we can also evaluate on a case-by-case basis on how widely applicable some are. 2. For existing addition algorithms in our code base, we will stop > improving them. OK, I'm a little afraid of bit-rot here, but we can see you things go. Cheers, Micah On Tue, Oct 22, 2019 at 7:09 PM Fan Liya <liya.fa...@gmail.com> wrote: > Hi Micah, > > Thank you for reading through my previous email. > > > Is the conversation about rejecting the changes in Flink something you > can link to? I found [1] which seems to allow for Arrow, in what seem like > reasonable places, just not inside the core planner (and even that is a > possibility with a proper PoC). However, I don't think the algorithms > proposed here are directly related to those discussions. > > There is a short discussion [1] in the ML. Please note that our proposal > is not officially "rejected". It is just ignored silently (in fact, this > makes no difference to us). We have had some conferences/discussions with > the Flink commiters and founders, it seems they like ideas, but no progress > has been made so far, because the change is too large and too risky. The > other issue you have indicated [2] represents another (earlier) attempt to > incorporate Arrow to Flink. However, that issue has no progress either. > > > I don't agree with this conclusion. Apache Drill, where most of the > Java code came from has been around for longer period of time. Also, even > without Arrow being around, columnar vs row based DB engines, is design > decision that has nothing to do with existing open source projects. Does > Flink use another open source library for its row representation? > > I think you mean that, row vs. columnar representations and open source > project selection are two independent issues. I agree with you. > Flink has its own implementation for row store, although I think they > should use Arrow directly (if it were available earlier), as columnar store > is the mainstream. > > > I think this circles back around to my original points: > > 1. Which users are we expecting to use the algorithms package that > aren't directly related to data transport in Java (i.e. additional > algorithms)? In many cases the algorithms seem like they would be query > engine specific. I haven't seen much evidence that there are users of the > Java code base that need all these algorithms. > > 2. Contributions to any project consume resources and peoples' time. > If there is only going to be one user of the code it might not belong in > Arrow "proper" due to these hurdles. > > I agree with you that contributing code consumes lots of effort, and we > should only provide general algorithms. > > To save the effort, or invest it to higher priority issues, we plan to: > 1. We will stop providing "additional algorithms", unless they are > explictly required. > 2. For existing addition algorithms in our code base, we will stop > improving them. > > Thanks again for your effort in reviewing algorithms and all the good > review comments. > > Best, > Liya Fan > > > [1] http://mail-archives.apache.org/mod_mbox/flink-dev/201907.mbox/browser > [2] https://issues.apache.org/jira/browse/FLINK-10929 > > On Sun, Oct 20, 2019 at 12:05 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Liya Fan, >> Is the conversation about rejecting the changes in Flink something you >> can link to? I found [1] which seems to allow for Arrow, in what seem like >> reasonable places, just not inside the core planner (and even that is a >> possibility with a proper PoC). However, I don't think the algorithms >> proposed here are directly related to those discussions. >> >> I think the lesson learned is that, we should provide some features >>> proactively (at least the general features), and make them good enough. >>> Apache Flink was started around 2015, and Arrow's Java project was started >>> in 2016. If Arrow were made available earlier, maybe Flink would have >>> chosen it in the first place. >> >> >> I don't agree with this conclusion. Apache Drill, where most of the Java >> code came from has been around for longer period of time. Also, even >> without Arrow being around, columnar vs row based DB engines, is design >> decision that has nothing to do with existing open source projects. Does >> Flink use another open source library for its row representation? >> >> When a users needs a algorithm, it may be already too late. AFAIK, most >>> users will choose to implement one by themselves, rather than openning a >>> JIRA in the community. It takes a long time to provide a PR, review the >>> code, merge the code, and wait for the next release. >> >> >> I think this circles back around to my original points: >> 1. Which users are we expecting to use the algorithms package that >> aren't directly related to data transport in Java (i.e. additional >> algorithms)? In many cases the algorithms seem like they would be query >> engine specific. I haven't seen much evidence that there are users of the >> Java code base that need all these algorithms. >> 2. Contributions to any project consume resources and peoples' time. >> If there is only going to be one user of the code it might not belong in >> Arrow "proper" due to these hurdles. >> >> Thanks, >> Micah >> >> [1] https://issues.apache.org/jira/browse/FLINK-10929 >> >> On Mon, Oct 14, 2019 at 10:38 PM Fan Liya <liya.fa...@gmail.com> wrote: >> >>> Hi Micah, >>> >>> Thanks a lot for your valuable comments. >>> I mostly agree with you points. >>> >>> Please note that in addition to algorithms directly related to Arrow >>> features, there are algorithms indirectly related to Arrow features, and >>> they should be attached a lower priority. An example is IPC -> dictionary >>> encoding -> vector sort. I agree that these features should adhere to >>> requirements of Arrow specification features. >>> >>> I also want to discuss why developers/users are not using some of the >>> algorithms (IMO): >>> >>> 1. To let them use the algorithms, they must be available and the users >>> must be aware of them. Our first algorithm was published only 3 months ago, >>> in 0.14, which has very limited functionalities. In 0.15, we have more >>> functionalities, but 0.15 has been published for no more than 10 days. >>> >>> 2. To let them use the algorithms, they must be good enough. In >>> particular, >>> 1) They must be performant. So far for us, performance improvement has >>> started, and there is still much room for further improvement. >>> 2) They must be functionally complete. We have not reached this goal >>> yet. For example, we do not support sorting all vector types. >>> 3) The algorithms should be easy to use. >>> >>> 3. Some SQL engines rely on native code implementations. For example, >>> Dremio has gandiva which is based on LLVM. For such scenarios, we do not >>> recommend them to use our algorithms. >>> >>> 4. Some SQL engines rely on Java implementations, but they do not rely >>> on Arrow (e.g. Drill, Flink, etc.). I think this issue (convince them to >>> use Arrow) should be placed with a higher priority. If these engines rely >>> on Arrow, it would be more likely that they will use Arrow algorithms >>> (provided that the algorithms are good enough). >>> >>> Concerning the last point, I want to share our experience when >>> introducing Arrow to another project. (I am not sure if it is appropriate >>> to discuss it here. Maybe you can give us some advice.) >>> >>> Apache Flink's runtime is based on Java but not on Arrow. We have a >>> private Flink branch in our team, which is based on Arrow. Compared with >>> the open source edition, our Arrow-based edition provides higher >>> performance. It improves the performance by 30% for the TPCH 1TB benchmark >>> (Please see Section 4 of [1]). We wanted to contribute the changes to the >>> Flink community and convince them to use Arrow in their core engine. >>> However, at least for now, they have not accepted the proposal. >>> >>> The main reason is that the changes are too big, which is too risky: >>> many underlying data structures, algorithms and the framework must be >>> changed. They admit that using Arrow is better, and they are aware of the >>> performance improvements, but they are just unwilling to take the risk. >>> >>> I think the lesson learned is that, we should provide some features >>> proactively (at least the general features), and make them good enough. >>> Apache Flink was started around 2015, and Arrow's Java project was started >>> in 2016. If Arrow were made available earlier, maybe Flink would have >>> chosen it in the first place. >>> >>> When a users needs a algorithm, it may be already too late. AFAIK, most >>> users will choose to implement one by themselves, rather than openning a >>> JIRA in the community. It takes a long time to provide a PR, review the >>> code, merge the code, and wait for the next release. >>> >>> Therefore, I think what we should do is to try all means to make Arrow >>> better, by providing general functionalities, by making them performant, by >>> making them functionally complete and making them easier to use. By making >>> Arrow better, I believe more users will chose Arrow. When trust is >>> established, more users will switch to Arrow. >>> >>> Best, >>> Liya Fan >>> >>> [1] >>> https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb >>> >>> [2] https://issues.apache.org/jira/browse/FLINK-13053 >>> >>> >>> On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> Hi Liya Fan, >>>> >>>>> I think the algorithms should be better termed "micro-algorithms". >>>>> They are termed "micro" in the sense that they do not directly compose a >>>>> query engine, because they only provide primitive functionalities (e.g. >>>>> vector sort). >>>>> Instead, they can be used as building blocks for query engines. The >>>>> major benefit of the micro-algorithms is their generality: they can be >>>>> used >>>>> in wide ranges of common scenarios. They can be used in more than one >>>>> query >>>>> engine. In addition, there are other common scenarios, like vector data >>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as >>>>> we >>>>> have already supported/discussed), IPC communication, data analysis, data >>>>> mining, etc. >>>> >>>> >>>> I agree the algorithm can be generally useful. But I still have >>>> concerns about who is going to use them. >>>> >>>> I think there are two categories the algorithms fall into: >>>> 1. Algorithms directly related to Arrow specification features. For >>>> these, I agree some of functionality will be needed as a reference >>>> implementation. At least for existing functionality I think there is >>>> already sufficient coverage and in some cases (i.e. dictionary there is >>>> already) duplicate coverage. >>>> >>>> 2. Other algorithms - I think these fall into "data analysis, data >>>> mining, etc.", and for these I think it goes back to the question, of >>>> whether developers/users would use the given algorithms to build there own >>>> one-off analysis or use already existing tools like Apache Spark or >>>> SQL-engine that already incorporates the algorithms. >>>> >>>> I'm little disappointed that more maintainers/developers haven't given >>>> there input on this topic. I hope some will help with the work involved in >>>> reviewing them if they find them valuable. >>>> >>>> Thanks, >>>> Micah >>>> >>>> >>>> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fan_li...@aliyun.com> wrote: >>>> >>>>> Hi Micah and Praveen, >>>>> >>>>> Thanks a lot for your valuable feedback. >>>>> >>>>> My thoughts on the problems: >>>>> >>>>> 1. About audiance of the algorithms: >>>>> >>>>> I think the algorithms should be better termed "micro-algorithms". >>>>> They are termed "micro" in the sense that they do not directly compose a >>>>> query engine, because they only provide primitive functionalities (e.g. >>>>> vector sort). >>>>> Instead, they can be used as building blocks for query engines. The >>>>> major benefit of the micro-algorithms is their generality: they can be >>>>> used >>>>> in wide ranges of common scenarios. They can be used in more than one >>>>> query >>>>> engine. In addition, there are other common scenarios, like vector data >>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as >>>>> we >>>>> have already supported/discussed), IPC communication, data analysis, data >>>>> mining, etc. >>>>> >>>>> 2. About performance improvments: >>>>> >>>>> Code generation and template types are powerful tools. In addition, >>>>> JIT is also a powerful tool, as it can inline megamorphic virtual >>>>> functions >>>>> for many scenarios, if the algorithm is implemented appropriately. >>>>> IMO, code generation is applicable to almost all scenarios to achieve >>>>> good performance, if we are willing to pay the price of code readability. >>>>> I will try to detail the principles for choosing these tools for >>>>> performance improvements later. >>>>> >>>>> Best, >>>>> Liya Fan >>>>> >>>>> ------------------------------------------------------------------ >>>>> 发件人:Praveen Kumar <prav...@dremio.com> >>>>> 发送时间:2019年10月4日(星期五) 19:20 >>>>> 收件人:Micah Kornfield <emkornfi...@gmail.com> >>>>> 抄 送:Fan Liya <liya.fa...@gmail.com>; dev <dev@arrow.apache.org> >>>>> 主 题:Re: [DISCUSS][Java] Design of the algorithm module >>>>> >>>>> Hi Micah, >>>>> >>>>> >>>>> I agree with 1., i think as an end user, what they would really want is a >>>>> query/data processing engine. I am not sure how easy/relevant the >>>>> algorithms will be in the absence of the engine. For e.g. most of these >>>>> operators would need to pipelined, handle memory, distribution etc. So >>>>> bundling this along with engine makes a lot more sense, the interfaces >>>>> required might be a bit different too for that. >>>>> >>>>> Thx. >>>>> >>>>> >>>>> >>>>> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <emkornfi...@gmail.com >>>>> > >>>>> wrote: >>>>> >>>>> > Hi Liya Fan, >>>>> > Thanks again for writing this up. I think it provides a road-map for >>>>> >>>>> > intended features. I commented on the document but I wanted to raise a >>>>> > few >>>>> >>>>> > high-level concerns here as well to get more feedback from the >>>>> > community. >>>>> > >>>>> >>>>> > 1. It isn't clear to me who the users will of this will be. My >>>>> > perception >>>>> >>>>> > is that in the Java ecosystem there aren't use-cases for the algorithms >>>>> >>>>> > outside of specific compute engines. I'm not super involved in >>>>> > open-source >>>>> >>>>> > Java these days so I would love to hear others opinions. For instance, >>>>> > I'm >>>>> >>>>> > not sure if Dremio would switch to using these algorithms instead of the >>>>> >>>>> > ones they've already open-sourced [1] and Apache Spark I believe is >>>>> > only >>>>> >>>>> > using Arrow for interfacing with Python (they similarly have there own >>>>> >>>>> > compute pipeline). I think you mentioned in the past that these are >>>>> > being >>>>> >>>>> > used internally on an engine that your company is working on, but if >>>>> > that >>>>> >>>>> > is the only consumer it makes me wonder if the algorithm development >>>>> > might >>>>> > be better served as part of that engine. >>>>> > >>>>> > 2. If we do move forward with this, we also need a plan for how to >>>>> >>>>> > optimize the algorithms to avoid virtual calls. There are two >>>>> > high-level >>>>> >>>>> > approaches template-based and (byte)code generation based. Both aren't >>>>> >>>>> > applicable in all situations but it would be good to come consensus on >>>>> > when >>>>> > (and when not to) use each. >>>>> > >>>>> > Thanks, >>>>> > Micah >>>>> > >>>>> > [1] >>>>> > >>>>> > >>>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external >>>>> > >>>>> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <liya.fa...@gmail.com >>>>> > wrote: >>>>> > >>>>> > > Hi Micah, >>>>> > > >>>>> > > Thanks for your effort and precious time. >>>>> > > Looking forward to receiving more valuable feedback from you. >>>>> > > >>>>> > > Best, >>>>> > > Liya Fan >>>>> > > >>>>> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield < >>>>> emkornfi...@gmail.com> >>>>> > > wrote: >>>>> > > >>>>> > >> Hi Liya Fan, >>>>> >>>>> > >> I started reviewing but haven't gotten all the way through it. I will >>>>> > try >>>>> > >> to leave more comments over the next few days. >>>>> > >> >>>>> >>>>> > >> Thanks again for the write-up I think it will help frame a productive >>>>> > >> conversation. >>>>> > >> >>>>> > >> -Micah >>>>> > >> >>>>> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fa...@gmail.com >>>>> > wrote: >>>>> > >> >>>>> > >>> Hi Micah, >>>>> > >>> >>>>> > >>> Thanks for your kind reminder. Comments are enabled now. >>>>> > >>> >>>>> > >>> Best, >>>>> > >>> Liya Fan >>>>> > >>> >>>>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield < >>>>> > emkornfi...@gmail.com> >>>>> > >>> wrote: >>>>> > >>> >>>>> > >>>> Hi Liya Fan, >>>>> >>>>> > >>>> Thank you for this writeup, it doesn't look like comments are >>>>> > >>>> enabled >>>>> > on >>>>> > >>>> the document. Could you allow for them? >>>>> > >>>> >>>>> > >>>> Thanks, >>>>> > >>>> Micah >>>>> > >>>> >>>>> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <liya.fa...@gmail.com> >>>>> > wrote: >>>>> > >>>> >>>>> > >>>> > Dear all, >>>>> > >>>> > >>>>> >>>>> > >>>> > We have prepared a document for discussing the requirements, >>>>> > >>>> > design >>>>> > >>>> and >>>>> > >>>> > implementation issues for the algorithm module of Java: >>>>> > >>>> > >>>>> > >>>> > >>>>> > >>>> > >>>>> > >>>> >>>>> > >>>>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing >>>>> > >>>> > >>>>> >>>>> > >>>> > So far, we have finished the initial draft for sort, search and >>>>> > >>>> dictionary >>>>> >>>>> > >>>> > encoding algorithms. Discussions for more algorithms may be >>>>> > >>>> > added in >>>>> > >>>> the >>>>> > >>>> > future. This document will keep evolving to reflect the latest >>>>> > >>>> discussion >>>>> > >>>> > results in the community and the latest code changes. >>>>> > >>>> > >>>>> > >>>> > Please give your valuable feedback. >>>>> > >>>> > >>>>> > >>>> > Best, >>>>> > >>>> > Liya Fan >>>>> > >>>> > >>>>> > >>>> >>>>> > >>> >>>>> > >>>>> >>>>> >>>>>