On Fri, Jun 8, 2018 at 9:18 AM, Tim Armstrong <tarmstr...@cloudera.com> wrote:
> > Meanwhile we found Impala is a very good MPP SQL query engine, so we > integrated > them together. > > Palo didn't integrate with Impala, it forked Impala's codebase and embedded > it in its own repository. I don't remember any attempts from the Palo team > to engage with the Impala community or attempt to work with us to > contribute any improvements. > > It looks like Palo is still pulling in new code from Impala. E.g. this > commit includes a bunch of code I wrote as part of IMPALA-3200: > https://github.com/baidu/palo/commit/2419384e8a211f10e7636afc6d3423 > 700ba22b5a#diff-1c501d9a8b5c3d1d1cce48d5e1fb0edf > > The code isn't owned by any individual, I contributed it to Apache and it's > free for anyone to do what they want to do with it, but pulling in > improvements from other projects without any attempt to attribute it or > contribute improvements back seems contrary to the Apache way. > +1. Also briefly browsing the code I found suspicious commits like this one: https://github.com/baidu/palo/commit/6486be64c319fe0beb8c6b4430c1662de54f182e ... in which a GPL license copyright by Oracle was "fixed" to be an Apache license copyright Baidu. So if this project does enter incubation I think we should be extra careful to audit the origins of all of the source code. -Todd > On Fri, Jun 8, 2018 at 9:12 AM, Todd Lipcon <t...@cloudera.com> wrote: > > > On Thu, Jun 7, 2018 at 11:55 PM, Li,De(BDG) <l...@baidu.com> wrote: > > > > > Hi, Jim > > > > > > Thank you for your response. > > > Actually, we start Palo in several years ago, and that time we > developed > > > the storage engine based on Mesa technology. > > > Meanwhile we found Impala is a very good MPP SQL query engine, so we > > > integrated them together. > > > > > > > From what I can tell of the Palo source, it's not so much an integration > as > > a copied-and-modified codebase, right? i.e Palo does not use Impala as a > > dependency, but rather shares a lot of code from the Impala project that > > has since diverged. > > > > > > > > > > With this integration, the goal of Palo is to implement a single, > > > full-featured, mysql protocol compatible data warehousing. > > > > > > > That sounds pretty similar to the goals of the Impala project. Impala > isn't > > MySQL-compatible at the moment but that seems more like a particular > > feature that could be added rather than a distinct identity of the > project. > > Otherwise, Impala's goal is to be a full featured data warehouse engine > as > > well. > > > > Generally Apache has no rules against multiple projects fulfilling > similar > > goals or use cases, even when those projects might compete. However I > think > > it would be relatively unusual to incubate a project that appears to be > > derived from a fork of an existing project, at least without first > > considering whether the additional feature set could be contributed back > to > > the existing community. > > > > -Todd > > > > > > > 在 2018/6/8 下午1:55, "Jim Apple" <jbap...@apache.org> 写入: > > > > > > >Hello! As a contributor to Impala, I’d be interested in hearing > thoughts > > > >from the Palo community about integration between Impala and Palo. > > > > > > > >For instance, are there any apparent design goals of Impala that the > > Palo > > > >community thinks are fundamentally incompatible with Palo? > > > > > > > >Thanks, > > > >Jim > > > > > > > >On 2018/06/08 04:45:32, "Li,De(BDG)" <l...@baidu.com> wrote: > > > >> Hi all, > > > >> > > > >> I am Reed, as a developer worked with the team for Palo (a MPP-based > > > >>interactive SQL data warehousing). > > > >> https://github.com/baidu/palo/wiki/Palo-Overview > > > >> > > > >> We propose to contribute Palo as an Apache Incubator project, and > > > >> we are still looking for possible Champion if anyone would like to > > > >>volunteer. Thanks a lot. > > > >> > > > >> Best Regards, > > > >> Reed > > > >> > > > >> =================== > > > >> The draft of the proposal as below: > > > >> > > > >> #Apache Palo > > > >> > > > >> ##Abstract > > > >> > > > >> Palo is a MPP-based interactive SQL data warehousing for reporting > and > > > >>analysis. > > > >> > > > >> ##Proposal > > > >> > > > >> We propose to contribute the Palo codebase and associated artifacts > > > >>(e.g. documentation, web-site content etc.) to the Apache Software > > > >>Foundation with the intent of forming a productive, meritocratic and > > > >>open community around Palo’s continued development, according to the > > > >>‘Apache Way’. > > > >> > > > >> Baidu owns several trademarks regarding Palo, and proposes to > transfer > > > >>ownership of those trademarks in full to the ASF. > > > >> > > > >> ###Overview of Palo > > > >> > > > >> Palo’s implementation consists of two daemons: Frontend (FE) and > > > >>Backend (BE). > > > >> > > > >> **Frontend daemon** consists of query coordinator and catalog > manager. > > > >>Query coordinator is responsible for receiving users’ sql queries, > > > >>compiling queries and managing queries execution. Catalog manager is > > > >>responsible for managing metadata such as databases, tables, > > partitions, > > > >>replicas and etc. Several frontend daemons could be deployed to > > > >>guarantee fault-tolerance, and load balancing. > > > >> > > > >> **Backend daemon** stores the data and executes the query fragments. > > > >>Many backend daemons could also be deployed to provide scalability > and > > > >>fault-tolerance. > > > >> > > > >> A typical Palo cluster generally composes of several frontend > daemons > > > >>and dozens to hundreds of backend daemons. > > > >> > > > >> Users can use MySQL client tools to connect any frontend daemon to > > > >>submit SQL query. Frontend receives the query and compiles it into > > query > > > >>plans executable by the Backend. Then Frontend sends the query plan > > > >>fragments to Backend. Backend will build a query execution DAG. Data > is > > > >>fetched and pipelined into the DAG. The final result response is sent > > to > > > >>client via Frontend. The distribution of query fragment execution > takes > > > >>minimizing data movement and maximizing scan locality as the main > goal. > > > >> > > > >> ##Background > > > >> > > > >> At Baidu, Prior to Palo, different tools were deployed to solve > > diverse > > > >>requirements in many ways. And when a use case requires the > > simultaneous > > > >>availability of capabilities that cannot all be provided by a single > > > >>tool, users were forced to build hybrid architectures that stitch > > > >>multiple tools together, but we believe that they shouldn’t need to > > > >>accept such inherent complexity. A storage system built to provide > > great > > > >>performance across a broad range of workloads provides a more elegant > > > >>solution to the problems that hybrid architectures aim to solve. Palo > > is > > > >>the solution. > > > >> > > > >> Palo is designed to be a simple and single tightly coupled system, > not > > > >>depending on other systems. Palo provides high concurrent low latency > > > >>point query performance, but also provides high throughput queries of > > > >>ad-hoc analysis. Palo provides bulk-batch data loading, but also > > > >>provides near real-time mini-batch data loading. Palo also provides > > high > > > >>availability, reliability, fault tolerance, and scalability. > > > >> > > > >> ##Rationale > > > >> > > > >> Palo mainly integrates the technology of Google Mesa and Apache > > Impala. > > > >> > > > >> Mesa is a highly scalable analytic data storage system that stores > > > >>critical measurement data related to Google's Internet advertising > > > >>business. Mesa is designed to satisfy complex and challenging set of > > > >>users’ and systems’ requirements, including near real-time data > > > >>ingestion and query ability, as well as high availability, > reliability, > > > >>fault tolerance, and scalability for large data and query volumes. > > > >> > > > >> Impala is a modern, open-source MPP SQL engine architected from the > > > >>ground up for the Hadoop data processing environment. At present, by > > > >>virtue of its superior performance and rich functionality, Impala has > > > >>been comparable to many commercial MPP database query engine. Mesa > can > > > >>satisfy the needs of many of our storage requirements, however Mesa > > > >>itself does not provide a SQL query engine; Impala is a very good MPP > > > >>SQL query engine, but the lack of a perfect distributed storage > engine. > > > >>So in the end we chose the combination of these two technologies. > > > >> > > > >> Learning from Mesa’s data model, we developed a distributed storage > > > >>engine. Unlike Mesa, this storage engine does not rely on any > > > >>distributed file system. Then we deeply integrate this storage engine > > > >>with Impala query engine. Query compiling, query execution > coordination > > > >>and catalog management of storage engine are integrated to be > frontend > > > >>daemon; query execution and data storage are integrated to be backend > > > >>daemon. With this integration, we implemented a single, > full-featured, > > > >>high performance state the art of MPP database, as well as > maintaining > > > >>the simplicity. > > > >> > > > >> ##Current Status > > > >> > > > >> Palo has been an open source project on GitHub > > > >>(https://github.com/baidu/palo). > > > >> > > > >> ###Meritocracy > > > >> > > > >> Palo has been deployed in production at Baidu and is applying more > > than > > > >>200 lines of business. It has demonstrated great performance benefits > > > >>and has proved to be a better way for reporting and analysis based > big > > > >>data. Still We look forward to growing a rich user and developer > > > >>community. > > > >> > > > >> ###Community > > > >> > > > >> Palo seeks to develop developer and user communities during > > incubation. > > > >> > > > >> ###Core Developers > > > >> > > > >> * Ruyue Ma (https://github.com/maruyue, > > > >>maru...@baidu.com<mailto:maru...@baidu.com>) > > > >> * Chun Zhao (https://github.com/imay, > > > >>buaa.zh...@gmail.com<mailto:buaa.zh...@gmail.com>) > > > >> * Mingyu Chen (https://github.com/morningman,chenmin...@baidu.com) > > > >> * De Li(https://github.com/lide-reed, > > > >>mailtol...@sina.com)<mailto:mailtol...@sina.com%EF%BC%89> > > > >> * Hao Chen (https://github.com/chenhao7253886, > > > >>chenha...@baidu.com<mailto:chenha...@baidu.com>) > > > >> * Chaoyong Li (https://github.com/cyongli, > > > >>lichaoy...@baidu.com<mailto:lichaoy...@baidu.com>) > > > >> * Bin Lin (https://github.com/lingbin, > > > >>lingbi...@gmail.com<mailto:lingbi...@gmail.com>) > > > >> > > > >> ###Alignment > > > >> > > > >> Palo is related to several other Apache projects: > > > >> > > > >> * Palo can also read data stored in Apache Hadoop clusters powered > by > > > >>the HDFS filesystem. > > > >> * Palo is closely integrated with Impala, which is also being > proposed > > > >>to the Incubator. > > > >> * Palo uses Apache Thrift as its RPC and serialization framework of > > > >>choice. > > > >> > > > >> ##Known Risks > > > >> > > > >> ###Orphaned Products > > > >> > > > >> The core developers of Palo team plan to work full time on this > > > >>project. There is very little risk of Palo getting orphaned since at > > > >>least one large company (Baidu) is extensively using it in their > > > >>production. For example, currently there are more than 200 use cases > > > >>using Palo in production. Furthermore, since Palo was open sourced at > > > >>the beginning of October 2017, it has received more than 660 stars > and > > > >>been forked nearly 170 times. We plan to extend and diversify this > > > >>community further through Apache. > > > >> > > > >> ###Inexperience with Open Source > > > >> > > > >> The core developers are all active users and followers of open > source. > > > >>They are already committers and contributors to the Palo Github > > project. > > > >>All have been involved with the source code that has been released > > under > > > >>an open source license, and several of them also have experience > > > >>developing code in an open source environment. Though the core set of > > > >>Developers do not have Apache Open Source experience, there are plans > > to > > > >>onboard individuals with Apache open source experience on to the > > project. > > > >> > > > >> ###Homogenous Developers > > > >> > > > >> The most of core developers are from Baidu, but after Palo was open > > > >>sourced, Palo received a lot of bug fixes and enhancements from other > > > >>developers not working at Baidu. > > > >> > > > >> ###Reliance on Salaried Developers > > > >> > > > >> Baidu invested in Palo as the OLAP solution and some of its key > > > >>engineers are working full time on the project. In addition, since > > there > > > >>is a growing Big Data need for scalable OLAP solutions, we look > forward > > > >>to other Apache developers and researchers to contribute to the > > project. > > > >>Also key to addressing the risk associated with relying on Salaried > > > >>developers from a single entity is to increase the diversity of the > > > >>contributors and actively lobby for Domain experts in the BI space to > > > >>contribute. Apache Palo intends to do this. > > > >> > > > >> ###An Excessive Fascination with the Apache Brand > > > >> > > > >> Palo is proposing to enter incubation at Apache in order to help > > > >>efforts to diversify the committer-base, not so much to capitalize on > > > >>the Apache brand. The Palo project is in production use already > inside > > > >>Baidu, but is not expected to be an Baidu product for external > > > >>customers. As such, the Palo project is not seeking to use the Apache > > > >>brand as a marketing tool. > > > >> > > > >> ##Documentation > > > >> > > > >> Information about Palo can be found at > https://github.com/baidu/palo. > > > >>The following links provide more information about Palo in open > source: > > > >> > > > >> * Palo wiki site: https://github.com/baidu/palo/wiki > > > >> * Codebase at Github: https://github.com/baidu/palo > > > >> * Issue Tracking: https://github.com/baidu/palo/issues > > > >> * Overview: https://github.com/baidu/palo/wiki/Palo-Overview > > > >> * FAQ: https://github.com/baidu/palo/wiki/Palo-FAQ > > > >> > > > >> ##Initial Source > > > >> > > > >> Palo has been under development since 2017 by a team of engineers at > > > >>Baidu Inc. It is currently hosted on Github.com under an Apache > license > > > >>at https://github.com/baidu/palo. > > > >> > > > >> ##External Dependencies > > > >> > > > >> Palo has the following external dependencies. > > > >> > > > >> * Google gflags (BSD) > > > >> * Google glog (BSD) > > > >> * Apache Thrift (Apache Software License v2.0) > > > >> * Apache Commons (Apache Software License v2.0) > > > >> * Boost (Boost Software License) > > > >> * OpenLdap (OpenLDAP Software License) > > > >> * rapidjson (Tencent) > > > >> * Google RE2 (BSD-style) > > > >> * lz4 (BSD) > > > >> * snappy (BSD) > > > >> * cyrus-sasl (CMU License) > > > >> * Twitter Bootstrap (Apache Software License v2.0) > > > >> * d3 (BSD) > > > >> * LLVM (BSD-like) > > > >> > > > >> Build and test dependencies: > > > >> > > > >> * ant (Apache Software License v2.0) > > > >> * Apache Maven (Apache Software License v2.0) > > > >> * cmake (BSD) > > > >> * clang (BSD) > > > >> * Google gtest (Apache Software License v2.0) > > > >> > > > >> ##Required Resources > > > >> > > > >> ###Mailing List > > > >> > > > >> There are currently no mailing lists. The usual mailing lists are > > > >>expected to be set up when entering incubation: > > > >> > > > >> > > > >>priv...@palo.incubator.apache.org<mailto:private@ > > > palo.incubator.apache.or > > > >>g> > > > >> d...@palo.incubator.apache.org<mailto:d...@palo.incubator.apache.org> > > > >> > > > >>comm...@palo.incubator.apache.org<mailto:commits@ > > > palo.incubator.apache.or > > > >>g> > > > >> > > > >> ###Subversion Directory > > > >> > > > >> Upon entering incubation: https://github.com/baidu/palo. > > > >> After incubation, we want to move the existing repo from > > > >>https://github.com/baidu/palo to Apache infrastructure. > > > >> > > > >> ###Issue Tracking > > > >> > > > >> Palo currently uses GitHub to track issues. Would like to continue > to > > > >>do so while we discuss migration possibilities with the ASF Infra > > > >>committee. > > > >> > > > >> ###Other Resources > > > >> > > > >> The existing code already has unit tests so we will make use of > > > >>existing Apache continuous testing infrastructure. The resulting load > > > >>should not be very large. > > > >> > > > >> ##Initial Committers > > > >> > > > >> * Ruyue Ma (https://github.com/maruyue, > > > >>maru...@baidu.com<mailto:maru...@baidu.com>) > > > >> * Chun Zhao (https://github.com/imay, > > > >>buaa.zh...@gmail.com<mailto:buaa.zh...@gmail.com>) > > > >> * Mingyu Chen (https://github.com/morningman,chenmin...@baidu.com) > > > >> * De Li(https://github.com/lide-reed, > > > >>mailtol...@sina.com)<mailto:mailtol...@sina.com%EF%BC%89> > > > >> * Hao Chen (https://github.com/chenhao7253886, > > > >>chenha...@baidu.com<mailto:chenha...@baidu.com>) > > > >> * Chaoyong Li (https://github.com/cyongli, > > > >>lichaoy...@baidu.com<mailto:lichaoy...@baidu.com>) > > > >> * Bin Lin (https://github.com/lingbin, > > > >>lingbi...@gmail.com<mailto:lingbi...@gmail.com>) > > > >> > > > >> ##Affiliations > > > >> > > > >> The initial committers are employees of Baidu Inc.. The nominated > > > >>mentors are employees of TODO. > > > >> > > > >> ##Sponsors > > > >> > > > >> ###Champion > > > >> > > > >> TODO > > > >> > > > >> ###Nominated Mentors > > > >> > > > >> * sijie guo, guosi...@gmail.com<mailto:guosi...@gmail.com> > > > >> * Luke Han, luke...@apache.org<mailto:luke...@apache.org> > > > >> * Zheng Shao, zs...@apache.org<mailto:zs...@apache.org> > > > >> > > > >> ###Sponsoring Entity > > > >> > > > >> We are requesting the Incubator to sponsor this project. > > > >> > > > > > > > >--------------------------------------------------------------------- > > > >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > >For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera