Re: Crail, Albis and Arrow

Wes McKinney Fri, 07 Sep 2018 07:57:44 -0700

On Fri, Sep 7, 2018 at 8:03 AM Animesh Trivedi
<[email protected]> wrote:
>
> To all,
>
> the blog is updated to:
> 1) point out that this blog is from a user of the Crail project, not an
> endorsement from the Crail project.
> 2) clarify that Arrow is not a storage format but an IPC format. And we
> have evaluated the performance of the Java libraries on HDFS, which has
> headroom for further performance optimizations.
>
> Please have a look and let me know if some further clarification is needed.
>
> Coming back to Wes's comments (hi again!) : how should we proceed if I want
> to benchmark Arrow's performance on Crail/HDFS in Java? I would be happy to
> have your inputs in this process, collaborating on this investigation and
> write our findings as a follow-up crail blog about "Arrow on Crail
> delivering 100 Gbps"? I suppose in the process I/we will look closely at
> I/O paths of Arrow/Java libraries. Does this sound like an interesting line
> of work to you?


Yes -- since I'm not a Java developer the best place to discuss this
further would be on the dev@arrow mailing list.

Thanks!
Wes

>
> Cheers,
> --
> Animesh
>
> On Thu, Sep 6, 2018 at 5:30 PM Wes McKinney <[email protected]> wrote:
>
> > hi Animesh,
> >
> > On Thu, Sep 6, 2018 at 12:23 AM Animesh Trivedi
> > <[email protected]> wrote:
> > >
> > > Hi Wes,
> > >
> > > Nice to connect to you too. We are happy to have your input on Albis and
> > > Arrow. Specifically:
> > >
> > > - We understand that Arrow is not a file format, but we chose to evaluate
> > > it in a mix with storage formats as Arrow is designed for in-memory
> > > columnar storage. The "in-memory" aspect of it is closer to flash/NVMe
> > than
> > > disks in terms of performance. And personally I was curious to try out
> > > Arrow :) We coded a simple benchmark (how fast one can materialize
> > values)
> > > because anything more complicated like relational queries would bring
> > > complexity from the underlying SQL engine.
> >
> > Right, but what you did in your benchmarks was neither in-memory or
> > memory-mapping IIUC -- you are accessing the memory through
> > synchronous Hadoop protobuf RPCs which deeply conflates the results
> > (even if the HDFS nodes are running atop NVMe). Additionally, the
> > Arrow Java library does not even yet support memory mapping (we do in
> > C++), so the only way to fairly evaluate that code right now is to run
> > on RAM-resident data.
> >
> > - Wes
> >
> > >
> > > - Yes, I will make it clear that the performance of Arrow that is
> > evaluated
> > > in the blog is for the less beaten on-heap Java path.
> > >
> > > Now coming to the interesting bit. Arrow storage performance tuning (HDFS
> > > or Crail) that I can help to investigate. This is a good starting point.
> > I
> > > will update you all on the Crail and Arrow mailing lists. Beyond
> > > performance, the multi-file storage model is where I am most interested.
> > It
> > > will help us to explore how different file types (column groups,
> > metadata)
> > > can be mapped to different storage (NVMe, DRAM, 3DXP) types that Crail
> > > supports. I think this is an interesting avenue to explore.
> > >
> > > Wes and Julian - thanks for the discussion.
> > >
> > > Cheers,
> > > --
> > > Animesh
> > >
> > > On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <[email protected]> wrote:
> > >
> > > > Animesh,
> > > >
> > > > Thanks for your thoughtful response.
> > > >
> > > > I think we’re now on the same page about the opportunities for
> > > > collaboration. And I saw that Wes posted to this thread too. I hope you
> > > > find ways to make Arrow and Crail work well together.
> > > >
> > > > Julian
> > > >
> > > >
> > > > > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <
> > [email protected]>
> > > > wrote:
> > > > >
> > > > > Hi Julian,
> > > > >
> > > > > Thanks for posting your thoughts.
> > > > >
> > > > > [As a Crail committer]: We agree that the notion of "we" creates
> > > > confusion.
> > > > > The Crail blog follows the trend in community projects, where a
> > blogpost
> > > > > falls in one of the two categories. The first type where a developer
> > > > talks
> > > > > about recent improvements, features, performance evaluation, etc. The
> > > > > second type is where "a user" presents how they used the system for
> > their
> > > > > use-case. The Albis blog post falls into the second category. We can
> > (and
> > > > > should for future references) definitely categorize and mark it clear
> > > > that
> > > > > way. And we would encourage the community, whoever tries Crail please
> > > > reach
> > > > > out to us to present your story on the Crail blog. Crail is
> > committed to
> > > > > provide the best possible performance to all its users, be it Albis,
> > > > Arrow,
> > > > > ORC, or Parquet.
> > > > >
> > > > > [As a developer of Albis and user of Crail]: I understand your
> > sentiment
> > > > > regarding the format wars, and it is not the aim of Albis to
> > establish
> > > > yet
> > > > > another file format. Albis started as a prototype to quickly
> > "explore"
> > > > > various design choices for storing relational data for a variety of
> > > > > scenarios with high-performance storage/networking devices - the
> > kind of
> > > > > devices Crail targets. This is something that I cannot easily do with
> > > > > Arrow, ORC, or Parquet with HDFS (or something similar) within a
> > > > reasonable
> > > > > effort and time-frame as they all have already chosen certain design
> > > > points
> > > > > and trade-offs. Crail and Albis are not tied (or are preferred over
> > other
> > > > > choices) to each other, though since it is coming from a same set of
> > > > > developers, I can see why the confusion arises. Having said this, I
> > will
> > > > be
> > > > > happy to contribute back to the Arrow community about the findings
> > from
> > > > > Albis, and would appreciate any help with that. I had a brief
> > discussion
> > > > > with Julien Le Dem at last DataWorks summit in San Jose about Albis
> > as
> > > > > well. I have not done a through investigation of Arrow over Crail,
> > but
> > > > > perhaps something that can be picked-up now as a starting point.
> > > > >
> > > > > I hope this clarifies the confusion. We will fix the blog post.
> > > > >
> > > > > Thanks,
> > > > > --
> > > > > Animesh
> > > > >
> > > > > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <[email protected]
> > <mailto:
> > > > [email protected]>> wrote:
> > > > >
> > > > >> I just read the blog post [1] about Crail and file formats. (I have
> > to
> > > > >> declare my interests up front: I have been a huge supporter of
> > Apache
> > > > >> Arrow, and I am a PMC member. I’m speaking here as an Arrow
> > contributor
> > > > and
> > > > >> enthusiast, not as a mentor of Crail.)
> > > > >>
> > > > >> I am a bit troubled about the endorsement of Albis in a Crail blog
> > post.
> > > > >> For example, "we have developed a new file format called Albis”.
> > Since
> > > > the
> > > > >> blog post is not signed, I take it that “We” means the authors of
> > the
> > > > paper
> > > > >> [2] mentioned in the blog post. But I hope that “we” does not mean
> > “we
> > > > as
> > > > >> Crail committers and PMC members".
> > > > >>
> > > > >> I know that there are different forces at play if you work for a
> > > > >> corporation, or are a researcher, or are an idealistic open source.
> > As a
> > > > >> researcher, you need to invent new stuff and prove that it is better
> > > > than
> > > > >> everything that has been done before.
> > > > >>
> > > > >> But I’ve been through the file format wars — ORC vs Parquet —
> > driven in
> > > > >> large part by two competing vendors. It was sickening, and a huge
> > waste
> > > > of
> > > > >> effort. Please, please don’t let this happen again. If you want to
> > make
> > > > >> Crail successful, you should make it absolutely clear to the Arrow,
> > ORC
> > > > and
> > > > >> Parquet communities that you will help to make Crail work as well
> > as it
> > > > >> possibly can
> > > > >>
> > > > >> Also, on paper Albis looks very similar to Arrow, and the
> > performance
> > > > gap
> > > > >> is fairly narrow. If you have found insights that would improve
> > Arrow, I
> > > > >> encourage you to share them and make Arrow better. It may be good
> > > > research
> > > > >> practice to accentuate the differences between the two, but it’s
> > good
> > > > open
> > > > >> source practice to find consensus between technologies, and merge
> > > > >> communities. There is a lot of work to be done, and too few people
> > to
> > > > do it.
> > > > >>
> > > > >> Lastly, I know I seem to be giving mixed messages here. I do believe
> > > > that
> > > > >> content about Crail will help drive engagement and build community
> > > > >> (controversial content even more so). I am delighted that the Crail
> > > > team is
> > > > >> writing blog posts and posting them to Twitter. But be careful not
> > to
> > > > >> alienate communities that could help Crail gain widespread adoption.
> > > > >>
> > > > >> Julian
> > > > >>
> > > > >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > > > >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html <
> > > > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>>
> > > > >>
> > > > >> [2] https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > > https://www.usenix.org/conference/atc18/presentation/trivedi> <
> > > > >> https://www.usenix.org/conference/atc18/presentation/trivedi <
> > > > https://www.usenix.org/conference/atc18/presentation/trivedi>>
> > > >
> > > >
> >

Re: Crail, Albis and Arrow

Reply via email to