Hi Wes - absolutely ! I will start a discussion on the Arrow mailing list. Cheers, -- Animesh
On Fri, Sep 7, 2018 at 4:57 PM Wes McKinney <[email protected]> wrote: > On Fri, Sep 7, 2018 at 8:03 AM Animesh Trivedi > <[email protected]> wrote: > > > > To all, > > > > the blog is updated to: > > 1) point out that this blog is from a user of the Crail project, not an > > endorsement from the Crail project. > > 2) clarify that Arrow is not a storage format but an IPC format. And we > > have evaluated the performance of the Java libraries on HDFS, which has > > headroom for further performance optimizations. > > > > Please have a look and let me know if some further clarification is > needed. > > > > Coming back to Wes's comments (hi again!) : how should we proceed if I > want > > to benchmark Arrow's performance on Crail/HDFS in Java? I would be happy > to > > have your inputs in this process, collaborating on this investigation and > > write our findings as a follow-up crail blog about "Arrow on Crail > > delivering 100 Gbps"? I suppose in the process I/we will look closely at > > I/O paths of Arrow/Java libraries. Does this sound like an interesting > line > > of work to you? > > Yes -- since I'm not a Java developer the best place to discuss this > further would be on the dev@arrow mailing list. > > Thanks! > Wes > > > > > Cheers, > > -- > > Animesh > > > > On Thu, Sep 6, 2018 at 5:30 PM Wes McKinney <[email protected]> wrote: > > > > > hi Animesh, > > > > > > On Thu, Sep 6, 2018 at 12:23 AM Animesh Trivedi > > > <[email protected]> wrote: > > > > > > > > Hi Wes, > > > > > > > > Nice to connect to you too. We are happy to have your input on Albis > and > > > > Arrow. Specifically: > > > > > > > > - We understand that Arrow is not a file format, but we chose to > evaluate > > > > it in a mix with storage formats as Arrow is designed for in-memory > > > > columnar storage. The "in-memory" aspect of it is closer to > flash/NVMe > > > than > > > > disks in terms of performance. And personally I was curious to try > out > > > > Arrow :) We coded a simple benchmark (how fast one can materialize > > > values) > > > > because anything more complicated like relational queries would bring > > > > complexity from the underlying SQL engine. > > > > > > Right, but what you did in your benchmarks was neither in-memory or > > > memory-mapping IIUC -- you are accessing the memory through > > > synchronous Hadoop protobuf RPCs which deeply conflates the results > > > (even if the HDFS nodes are running atop NVMe). Additionally, the > > > Arrow Java library does not even yet support memory mapping (we do in > > > C++), so the only way to fairly evaluate that code right now is to run > > > on RAM-resident data. > > > > > > - Wes > > > > > > > > > > > - Yes, I will make it clear that the performance of Arrow that is > > > evaluated > > > > in the blog is for the less beaten on-heap Java path. > > > > > > > > Now coming to the interesting bit. Arrow storage performance tuning > (HDFS > > > > or Crail) that I can help to investigate. This is a good starting > point. > > > I > > > > will update you all on the Crail and Arrow mailing lists. Beyond > > > > performance, the multi-file storage model is where I am most > interested. > > > It > > > > will help us to explore how different file types (column groups, > > > metadata) > > > > can be mapped to different storage (NVMe, DRAM, 3DXP) types that > Crail > > > > supports. I think this is an interesting avenue to explore. > > > > > > > > Wes and Julian - thanks for the discussion. > > > > > > > > Cheers, > > > > -- > > > > Animesh > > > > > > > > On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <[email protected]> wrote: > > > > > > > > > Animesh, > > > > > > > > > > Thanks for your thoughtful response. > > > > > > > > > > I think we’re now on the same page about the opportunities for > > > > > collaboration. And I saw that Wes posted to this thread too. I > hope you > > > > > find ways to make Arrow and Crail work well together. > > > > > > > > > > Julian > > > > > > > > > > > > > > > > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > > Hi Julian, > > > > > > > > > > > > Thanks for posting your thoughts. > > > > > > > > > > > > [As a Crail committer]: We agree that the notion of "we" creates > > > > > confusion. > > > > > > The Crail blog follows the trend in community projects, where a > > > blogpost > > > > > > falls in one of the two categories. The first type where a > developer > > > > > talks > > > > > > about recent improvements, features, performance evaluation, > etc. The > > > > > > second type is where "a user" presents how they used the system > for > > > their > > > > > > use-case. The Albis blog post falls into the second category. We > can > > > (and > > > > > > should for future references) definitely categorize and mark it > clear > > > > > that > > > > > > way. And we would encourage the community, whoever tries Crail > please > > > > > reach > > > > > > out to us to present your story on the Crail blog. Crail is > > > committed to > > > > > > provide the best possible performance to all its users, be it > Albis, > > > > > Arrow, > > > > > > ORC, or Parquet. > > > > > > > > > > > > [As a developer of Albis and user of Crail]: I understand your > > > sentiment > > > > > > regarding the format wars, and it is not the aim of Albis to > > > establish > > > > > yet > > > > > > another file format. Albis started as a prototype to quickly > > > "explore" > > > > > > various design choices for storing relational data for a variety > of > > > > > > scenarios with high-performance storage/networking devices - the > > > kind of > > > > > > devices Crail targets. This is something that I cannot easily do > with > > > > > > Arrow, ORC, or Parquet with HDFS (or something similar) within a > > > > > reasonable > > > > > > effort and time-frame as they all have already chosen certain > design > > > > > points > > > > > > and trade-offs. Crail and Albis are not tied (or are preferred > over > > > other > > > > > > choices) to each other, though since it is coming from a same > set of > > > > > > developers, I can see why the confusion arises. Having said > this, I > > > will > > > > > be > > > > > > happy to contribute back to the Arrow community about the > findings > > > from > > > > > > Albis, and would appreciate any help with that. I had a brief > > > discussion > > > > > > with Julien Le Dem at last DataWorks summit in San Jose about > Albis > > > as > > > > > > well. I have not done a through investigation of Arrow over > Crail, > > > but > > > > > > perhaps something that can be picked-up now as a starting point. > > > > > > > > > > > > I hope this clarifies the confusion. We will fix the blog post. > > > > > > > > > > > > Thanks, > > > > > > -- > > > > > > Animesh > > > > > > > > > > > > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <[email protected] > > > <mailto: > > > > > [email protected]>> wrote: > > > > > > > > > > > >> I just read the blog post [1] about Crail and file formats. (I > have > > > to > > > > > >> declare my interests up front: I have been a huge supporter of > > > Apache > > > > > >> Arrow, and I am a PMC member. I’m speaking here as an Arrow > > > contributor > > > > > and > > > > > >> enthusiast, not as a mentor of Crail.) > > > > > >> > > > > > >> I am a bit troubled about the endorsement of Albis in a Crail > blog > > > post. > > > > > >> For example, "we have developed a new file format called Albis”. > > > Since > > > > > the > > > > > >> blog post is not signed, I take it that “We” means the authors > of > > > the > > > > > paper > > > > > >> [2] mentioned in the blog post. But I hope that “we” does not > mean > > > “we > > > > > as > > > > > >> Crail committers and PMC members". > > > > > >> > > > > > >> I know that there are different forces at play if you work for a > > > > > >> corporation, or are a researcher, or are an idealistic open > source. > > > As a > > > > > >> researcher, you need to invent new stuff and prove that it is > better > > > > > than > > > > > >> everything that has been done before. > > > > > >> > > > > > >> But I’ve been through the file format wars — ORC vs Parquet — > > > driven in > > > > > >> large part by two competing vendors. It was sickening, and a > huge > > > waste > > > > > of > > > > > >> effort. Please, please don’t let this happen again. If you want > to > > > make > > > > > >> Crail successful, you should make it absolutely clear to the > Arrow, > > > ORC > > > > > and > > > > > >> Parquet communities that you will help to make Crail work as > well > > > as it > > > > > >> possibly can > > > > > >> > > > > > >> Also, on paper Albis looks very similar to Arrow, and the > > > performance > > > > > gap > > > > > >> is fairly narrow. If you have found insights that would improve > > > Arrow, I > > > > > >> encourage you to share them and make Arrow better. It may be > good > > > > > research > > > > > >> practice to accentuate the differences between the two, but it’s > > > good > > > > > open > > > > > >> source practice to find consensus between technologies, and > merge > > > > > >> communities. There is a lot of work to be done, and too few > people > > > to > > > > > do it. > > > > > >> > > > > > >> Lastly, I know I seem to be giving mixed messages here. I do > believe > > > > > that > > > > > >> content about Crail will help drive engagement and build > community > > > > > >> (controversial content even more so). I am delighted that the > Crail > > > > > team is > > > > > >> writing blog posts and posting them to Twitter. But be careful > not > > > to > > > > > >> alienate communities that could help Crail gain widespread > adoption. > > > > > >> > > > > > >> Julian > > > > > >> > > > > > >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html > < > > > > > >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html < > > > > > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>> > > > > > >> > > > > > >> [2] > https://www.usenix.org/conference/atc18/presentation/trivedi < > > > > > https://www.usenix.org/conference/atc18/presentation/trivedi> < > > > > > >> https://www.usenix.org/conference/atc18/presentation/trivedi < > > > > > https://www.usenix.org/conference/atc18/presentation/trivedi>> > > > > > > > > > > > > > >
