Re: [DISCUSS][Java] Design of the algorithm module

Micah Kornfield Wed, 23 Oct 2019 19:01:43 -0700

>
> To save the effort, or invest it to higher priority issues, we plan to:
> 1. We will stop providing "additional algorithms", unless they are
> explictly required.


This sounds reasonable, we can also evaluate on a case-by-case basis on how
widely applicable some are.

2. For existing addition algorithms in our code base, we will stop
> improving them.

OK, I'm a little afraid of bit-rot here, but we can see you things go.

Cheers,
Micah


On Tue, Oct 22, 2019 at 7:09 PM Fan Liya <liya.fa...@gmail.com> wrote:

> Hi Micah,
>
> Thank you for reading through my previous email.
>
> > Is the conversation about rejecting the changes in Flink something you
> can link to? I found [1] which seems to allow for Arrow, in what seem like
> reasonable places, just not inside the core planner (and even that is a
> possibility with a proper PoC).  However, I don't think the algorithms
> proposed here are directly related to those discussions.
>
> There is a short discussion [1] in the ML. Please note that our proposal
> is not officially "rejected". It is just ignored silently (in fact, this
> makes no difference to us). We have had some conferences/discussions with
> the Flink commiters and founders, it seems they like ideas, but no progress
> has been made so far, because the change is too large and too risky. The
> other issue you have indicated [2] represents another (earlier) attempt to
> incorporate Arrow to Flink. However, that issue has no progress either.
>
> > I don't agree with this conclusion.  Apache Drill, where most of the
> Java code came from has been around for longer period of time.  Also, even
> without Arrow being around, columnar vs row based DB engines, is design
> decision that has nothing to do with existing open source projects.  Does
> Flink use another open source library for its row representation?
>
> I think you mean that, row vs. columnar representations and open source
> project selection are two independent issues. I agree with you.
> Flink has its own implementation for row store, although I think they
> should use Arrow directly (if it were available earlier), as columnar store
> is the mainstream.
>
> > I think this circles back around to my original points:
> >  1.  Which users are we expecting to use the algorithms package that
> aren't directly related to data transport in Java (i.e. additional
> algorithms)?  In many cases the algorithms seem like they would be query
> engine specific.  I haven't seen much evidence that there are users of the
> Java code base that need all these algorithms.
> >  2.  Contributions to any project consume resources and peoples' time.
> If there is only going to be one user of the code it might not belong in
> Arrow "proper" due to these hurdles.
>
> I agree with you that contributing code consumes lots of effort, and we
> should only provide general algorithms.
>
> To save the effort, or invest it to higher priority issues, we plan to:
> 1. We will stop providing "additional algorithms", unless they are
> explictly required.
> 2. For existing addition algorithms in our code base, we will stop
> improving them.
>
> Thanks again for your effort in reviewing algorithms and all the good
> review comments.
>
> Best,
> Liya Fan
>
>
> [1] http://mail-archives.apache.org/mod_mbox/flink-dev/201907.mbox/browser
> [2] https://issues.apache.org/jira/browse/FLINK-10929
>
> On Sun, Oct 20, 2019 at 12:05 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Liya Fan,
>> Is the conversation about rejecting the changes in Flink something you
>> can link to? I found [1] which seems to allow for Arrow, in what seem like
>> reasonable places, just not inside the core planner (and even that is a
>> possibility with a proper PoC).  However, I don't think the algorithms
>> proposed here are directly related to those discussions.
>>
>> I think the lesson learned is that, we should provide some features
>>> proactively (at least the general features), and make them good enough.
>>> Apache Flink was started around 2015, and Arrow's Java project was started
>>> in 2016. If Arrow were made available earlier, maybe Flink would have
>>> chosen it in the first place.
>>
>>
>> I don't agree with this conclusion.  Apache Drill, where most of the Java
>> code came from has been around for longer period of time.  Also, even
>> without Arrow being around, columnar vs row based DB engines, is design
>> decision that has nothing to do with existing open source projects.  Does
>> Flink use another open source library for its row representation?
>>
>> When a users needs a algorithm, it may be already too late. AFAIK, most
>>> users will choose to implement one by themselves, rather than openning a
>>> JIRA in the community. It takes a long time to provide a PR, review the
>>> code, merge the code, and wait for the next release.
>>
>>
>> I think this circles back around to my original points:
>>   1.  Which users are we expecting to use the algorithms package that
>> aren't directly related to data transport in Java (i.e. additional
>> algorithms)?  In many cases the algorithms seem like they would be query
>> engine specific.  I haven't seen much evidence that there are users of the
>> Java code base that need all these algorithms.
>>   2.  Contributions to any project consume resources and peoples' time.
>> If there is only going to be one user of the code it might not belong in
>> Arrow "proper" due to these hurdles.
>>
>> Thanks,
>> Micah
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10929
>>
>> On Mon, Oct 14, 2019 at 10:38 PM Fan Liya <liya.fa...@gmail.com> wrote:
>>
>>> Hi Micah,
>>>
>>> Thanks a lot for your valuable comments.
>>> I mostly agree with you points.
>>>
>>> Please note that in addition to algorithms directly related to Arrow
>>> features, there are algorithms indirectly related to Arrow features, and
>>> they should be attached a lower priority. An example is IPC -> dictionary
>>> encoding -> vector sort. I agree that these features should adhere to
>>> requirements of Arrow specification features.
>>>
>>> I also want to discuss why developers/users are not using some of the
>>> algorithms (IMO):
>>>
>>> 1. To let them use the algorithms, they must be available and the users
>>> must be aware of them. Our first algorithm was published only 3 months ago,
>>> in 0.14, which has very limited functionalities. In 0.15, we have more
>>> functionalities, but 0.15 has been published for no more than 10 days.
>>>
>>> 2. To let them use the algorithms, they must be good enough. In
>>> particular,
>>>   1) They must be performant. So far for us, performance improvement has
>>> started, and there is still much room for further improvement.
>>>   2) They must be functionally complete. We have not reached this goal
>>> yet. For example, we do not support sorting all vector types.
>>>   3) The algorithms should be easy to use.
>>>
>>> 3. Some SQL engines rely on native code implementations. For example,
>>> Dremio has gandiva which is based on LLVM. For such scenarios, we do not
>>> recommend them to use our algorithms.
>>>
>>> 4. Some SQL engines rely on Java implementations, but they do not rely
>>> on Arrow (e.g. Drill, Flink, etc.). I think this issue (convince them to
>>> use Arrow) should be placed with a higher priority. If these engines rely
>>> on Arrow, it would be more likely that they will use Arrow algorithms
>>> (provided that the algorithms are good enough).
>>>
>>> Concerning the last point, I want to share our experience when
>>> introducing Arrow to another project. (I am not sure if it is appropriate
>>> to discuss it here. Maybe you can give us some advice.)
>>>
>>> Apache Flink's runtime is based on Java but not on Arrow. We have a
>>> private Flink branch in our team, which is based on Arrow. Compared with
>>> the open source edition, our Arrow-based edition provides higher
>>> performance. It improves the performance by 30% for the TPCH 1TB benchmark
>>> (Please see Section 4 of [1]). We wanted to contribute the changes to the
>>> Flink community and convince them to use Arrow in their core engine.
>>> However, at least for now, they have not accepted the proposal.
>>>
>>> The main reason is that the changes are too big, which is too risky:
>>> many underlying data structures, algorithms and the framework must be
>>> changed. They admit that using Arrow is better, and they are aware of the
>>> performance improvements, but they are just unwilling to take the risk.
>>>
>>> I think the lesson learned is that, we should provide some features
>>> proactively (at least the general features), and make them good enough.
>>> Apache Flink was started around 2015, and Arrow's Java project was started
>>> in 2016. If Arrow were made available earlier, maybe Flink would have
>>> chosen it in the first place.
>>>
>>> When a users needs a algorithm, it may be already too late. AFAIK, most
>>> users will choose to implement one by themselves, rather than openning a
>>> JIRA in the community. It takes a long time to provide a PR, review the
>>> code, merge the code, and wait for the next release.
>>>
>>> Therefore, I think what we should do is to try all means to make Arrow
>>> better, by providing general functionalities, by making them performant, by
>>> making them functionally complete and making them easier to use. By making
>>> Arrow better, I believe more users will chose Arrow. When trust is
>>> established, more users will switch to Arrow.
>>>
>>> Best,
>>> Liya Fan
>>>
>>> [1]
>>> https://docs.google.com/document/d/1cUHb-_Pbe4NMU3Igwt4tytEmI66jQxev00IL99e2wFY/edit#heading=h.50xdeg1htedb
>>>
>>> [2] https://issues.apache.org/jira/browse/FLINK-13053
>>>
>>>
>>> On Mon, Oct 14, 2019 at 5:46 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Liya Fan,
>>>>
>>>>> I think the algorithms should be better termed "micro-algorithms".
>>>>> They are termed "micro" in the sense that they do not directly compose a
>>>>> query engine, because they only provide primitive functionalities (e.g.
>>>>> vector sort).
>>>>> Instead, they can be used as building blocks for query engines.  The
>>>>> major benefit of the micro-algorithms is their generality: they can be 
>>>>> used
>>>>> in wide ranges of common scenarios. They can be used in more than one 
>>>>> query
>>>>> engine. In addition, there are other common scenarios, like vector data
>>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as 
>>>>> we
>>>>> have already supported/discussed), IPC communication, data analysis, data
>>>>> mining, etc.
>>>>
>>>>
>>>> I agree the algorithm can be generally useful.  But I still have
>>>> concerns about who is going to use them.
>>>>
>>>> I think there are two categories the algorithms fall into:
>>>> 1.  Algorithms directly related to Arrow specification features.  For
>>>> these, I agree some of functionality will be needed as a reference
>>>> implementation.  At least for existing functionality I think there is
>>>> already sufficient coverage and in some cases (i.e. dictionary there is
>>>> already) duplicate coverage.
>>>>
>>>> 2.  Other algorithms -  I think these fall into "data analysis, data
>>>> mining, etc.", and for these I think it goes back to the question, of
>>>> whether developers/users would use the given algorithms to build there own
>>>> one-off analysis or use already existing tools like Apache Spark or
>>>> SQL-engine that already incorporates the algorithms.
>>>>
>>>> I'm little disappointed that more maintainers/developers haven't given
>>>> there input on this topic.  I hope some will help with the work involved in
>>>> reviewing them if they find them valuable.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>> On Fri, Oct 4, 2019 at 11:59 PM fan_li_ya <fan_li...@aliyun.com> wrote:
>>>>
>>>>> Hi Micah and Praveen,
>>>>>
>>>>> Thanks a lot for your valuable feedback.
>>>>>
>>>>> My thoughts on the problems:
>>>>>
>>>>> 1. About audiance of the algorithms:
>>>>>
>>>>> I think the algorithms should be better termed "micro-algorithms".
>>>>> They are termed "micro" in the sense that they do not directly compose a
>>>>> query engine, because they only provide primitive functionalities (e.g.
>>>>> vector sort).
>>>>> Instead, they can be used as building blocks for query engines.  The
>>>>> major benefit of the micro-algorithms is their generality: they can be 
>>>>> used
>>>>> in wide ranges of common scenarios. They can be used in more than one 
>>>>> query
>>>>> engine. In addition, there are other common scenarios, like vector data
>>>>> compression/decompression (e.g. dictionary encoding and RLE encoding, as 
>>>>> we
>>>>> have already supported/discussed), IPC communication, data analysis, data
>>>>> mining, etc.
>>>>>
>>>>> 2. About performance improvments:
>>>>>
>>>>> Code generation and template types are powerful tools. In addition,
>>>>> JIT is also a powerful tool, as it can inline megamorphic virtual 
>>>>> functions
>>>>> for many scenarios, if the algorithm is implemented appropriately.
>>>>> IMO, code generation is applicable to almost all scenarios to achieve
>>>>> good performance, if we are willing to pay the price of code readability.
>>>>> I will try to detail the principles for choosing these tools for
>>>>> performance improvements later.
>>>>>
>>>>> Best,
>>>>> Liya Fan
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> 发件人：Praveen Kumar <prav...@dremio.com>
>>>>> 发送时间：2019年10月4日(星期五) 19:20
>>>>> 收件人：Micah Kornfield <emkornfi...@gmail.com>
>>>>> 抄 送：Fan Liya <liya.fa...@gmail.com>; dev <dev@arrow.apache.org>
>>>>> 主 题：Re: [DISCUSS][Java] Design of the algorithm module
>>>>>
>>>>> Hi Micah,
>>>>>
>>>>>
>>>>> I agree with 1., i think as an end user, what they would really want is a
>>>>> query/data processing engine. I am not sure how easy/relevant the
>>>>> algorithms will be in the absence of the engine. For e.g. most of these
>>>>> operators would need to pipelined, handle memory, distribution etc. So
>>>>> bundling this along with engine makes a lot more sense, the interfaces
>>>>> required might be a bit different too for that.
>>>>>
>>>>> Thx.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 3, 2019 at 10:27 AM Micah Kornfield <emkornfi...@gmail.com
>>>>> >
>>>>> wrote:
>>>>>
>>>>> > Hi Liya Fan,
>>>>> > Thanks again for writing this up.  I think it provides a road-map for
>>>>>
>>>>> > intended features.  I commented on the document but I wanted to raise a 
>>>>> > few
>>>>>
>>>>> > high-level concerns here as well to get more feedback from the 
>>>>> > community.
>>>>> >
>>>>>
>>>>> > 1.  It isn't clear to me who the users will of this will be.  My 
>>>>> > perception
>>>>>
>>>>> > is that in the Java ecosystem there aren't use-cases for the algorithms
>>>>>
>>>>> > outside of specific compute engines.  I'm not super involved in 
>>>>> > open-source
>>>>>
>>>>> > Java these days so I would love to hear others opinions. For instance, 
>>>>> > I'm
>>>>>
>>>>> > not sure if Dremio would switch to using these algorithms instead of the
>>>>>
>>>>> > ones they've already open-sourced  [1] and Apache Spark I believe is 
>>>>> > only
>>>>>
>>>>> > using Arrow for interfacing with Python (they similarly have there own
>>>>>
>>>>> > compute pipeline).  I think you mentioned in the past that these are 
>>>>> > being
>>>>>
>>>>> > used internally on an engine that your company is working on, but if 
>>>>> > that
>>>>>
>>>>> > is the only consumer it makes me wonder if the algorithm development 
>>>>> > might
>>>>> > be better served as part of that engine.
>>>>> >
>>>>> > 2.  If we do move forward with this, we also need a plan for how to
>>>>>
>>>>> > optimize the algorithms to avoid virtual calls.  There are two 
>>>>> > high-level
>>>>>
>>>>> > approaches template-based and (byte)code generation based.  Both aren't
>>>>>
>>>>> > applicable in all situations but it would be good to come consensus on 
>>>>> > when
>>>>> > (and when not to) use each.
>>>>> >
>>>>> > Thanks,
>>>>> > Micah
>>>>> >
>>>>> > [1]
>>>>> >
>>>>> >
>>>>> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/sort/external
>>>>> >
>>>>> > On Tue, Sep 24, 2019 at 6:48 AM Fan Liya <liya.fa...@gmail.com
>>>>> > wrote:
>>>>> >
>>>>> > > Hi Micah,
>>>>> > >
>>>>> > > Thanks for your effort and precious time.
>>>>> > > Looking forward to receiving more valuable feedback from you.
>>>>> > >
>>>>> > > Best,
>>>>> > > Liya Fan
>>>>> > >
>>>>> > > On Tue, Sep 24, 2019 at 2:12 PM Micah Kornfield <
>>>>> emkornfi...@gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > >> Hi Liya Fan,
>>>>>
>>>>> > >> I started reviewing but haven't gotten all the way through it. I will
>>>>> > try
>>>>> > >> to leave more comments over the next few days.
>>>>> > >>
>>>>>
>>>>> > >> Thanks again for the write-up I think it will help frame a productive
>>>>> > >> conversation.
>>>>> > >>
>>>>> > >> -Micah
>>>>> > >>
>>>>> > >> On Tue, Sep 17, 2019 at 1:47 AM Fan Liya <liya.fa...@gmail.com
>>>>> > wrote:
>>>>> > >>
>>>>> > >>> Hi Micah,
>>>>> > >>>
>>>>> > >>> Thanks for your kind reminder. Comments are enabled now.
>>>>> > >>>
>>>>> > >>> Best,
>>>>> > >>> Liya Fan
>>>>> > >>>
>>>>> > >>> On Tue, Sep 17, 2019 at 12:45 PM Micah Kornfield <
>>>>> > emkornfi...@gmail.com>
>>>>> > >>> wrote:
>>>>> > >>>
>>>>> > >>>> Hi Liya Fan,
>>>>>
>>>>> > >>>> Thank you for this writeup, it doesn't look like comments are 
>>>>> > >>>> enabled
>>>>> > on
>>>>> > >>>> the document.  Could you allow for them?
>>>>> > >>>>
>>>>> > >>>> Thanks,
>>>>> > >>>> Micah
>>>>> > >>>>
>>>>> > >>>> On Sat, Sep 14, 2019 at 6:57 AM Fan Liya <liya.fa...@gmail.com>
>>>>> > wrote:
>>>>> > >>>>
>>>>> > >>>> > Dear all,
>>>>> > >>>> >
>>>>>
>>>>> > >>>> > We have prepared a document for discussing the requirements, 
>>>>> > >>>> > design
>>>>> > >>>> and
>>>>> > >>>> > implementation issues for the algorithm module of Java:
>>>>> > >>>> >
>>>>> > >>>> >
>>>>> > >>>> >
>>>>> > >>>>
>>>>> >
>>>>> https://docs.google.com/document/d/17nqHWS7gs0vARfeDAcUEbhKMOYHnCtA46TOY_Nls69s/edit?usp=sharing
>>>>> > >>>> >
>>>>>
>>>>> > >>>> > So far, we have finished the initial draft for sort, search and
>>>>> > >>>> dictionary
>>>>>
>>>>> > >>>> > encoding algorithms. Discussions for more algorithms may be 
>>>>> > >>>> > added in
>>>>> > >>>> the
>>>>> > >>>> > future. This document will keep evolving to reflect the latest
>>>>> > >>>> discussion
>>>>> > >>>> > results in the community and the latest code changes.
>>>>> > >>>> >
>>>>> > >>>> > Please give your valuable feedback.
>>>>> > >>>> >
>>>>> > >>>> > Best,
>>>>> > >>>> > Liya Fan
>>>>> > >>>> >
>>>>> > >>>>
>>>>> > >>>
>>>>> >
>>>>>
>>>>>
>>>>>

Re: [DISCUSS][Java] Design of the algorithm module

Reply via email to