Hi Yibo, Thanks a lot for taking the time to set up and run SkyhookDM. For the performance evaluation, you can use the script here [1]. This script measures the query execution based on different row selectivities which also translates to the amount of data moved around. Please let me know if you have any questions. Thanks again.
[1] https://github.com/JayjeetAtGithub/skyhook-nsdi/blob/master/bench.py Best, Jayjeet On Mon, Jun 21, 2021 at 3:59 PM Yibo Cai <yibo....@arm.com> wrote: > Hi Jayjeet, > > I've successfully validated basic functions based on the links you > provided, on both Arm64 and x86, with binaries built from your PR. > Everything looks fine. From perf, I can see arrow code is running > actively on ceph osd nodes. > > Currently, I deployed and tested on 4 VMs. For performance evaluation, I > will turn to bare metal servers with NVMe drives. Do you have test cases > for the benchmarking? Thanks. > > Yibo > > On 6/8/21 4:18 PM, Jayjeet Chakraborty wrote: > > Hi Yibo, > > > > Thanks a lot for your interest in our work. Please refer to this [1] > guide to deploy a complete environment on a cluster of nodes. Regarding > your comment about a Ceph patch, the arrow object class that we implement > is actually a plugin and does not require the Ceph source tree for building > or maintaining it. It only requires the rados-objclass-dev package as a > dependency. We provide the CMake option ARROW_CLS to allow optional build > of the object class plugin. Please let us know if you have any questions or > comments. We really appreciate you taking the time to look into our > project. Thanks. > > > > Best, > > Jayjeet > > > > [1] > https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/adapters/arrow-rados-cls/docs/deploy.md > > > > On 2021/06/07 10:36:08, Yibo Cai <yibo....@arm.com> wrote: > >> Hi Jayjeet, > >> > >> It is exciting to see a real world computational storage solution built > >> upon Arrow and Ceph. Amazing work! > >> > >> We are interesting in this project (I'm from Arm open source software > >> team focusing on storage and big data OSS), and would like to reproduce > >> your works first, then evaluate performance on Arm platform. > >> > >> I went through your Arrow PR, it looks great. IIUC, there should be a > >> corresponding Ceph patch implementing the object class with Arrow. > >> > >> I wonder the best approach to deploy a complete environment for a quick > >> evaluation. Any comment is welcomed. Thanks. > >> > >> Yibo > >> > >> On 6/2/21 3:42 AM, Jayjeet Chakraborty wrote: > >>> Dear Arrow Community, > >>> > >>> In our previous discussion, we planned on implementing a new Dataset > API > >>> like InMemoryDataset to interact with objects containing IPC data > stored in > >>> Ceph/RADOS <https://ceph.io/>. We had implemented this design and > raised a > >>> PR <https://github.com/apache/arrow/pull/8647>. But when we started > adding > >>> the dataset discovery functionality, we found ourselves reimplementing > >>> filesystem abstractions and its metadata management. We closed the > original > >>> PR and raised a new PR <https://github.com/apache/arrow/pull/10431> > where > >>> we redesigned our implementation to use the Ceph filesystem as our > file I/O > >>> interface since it provides fast metadata support via the Ceph metadata > >>> servers (MDS). We also decided to store data using one of the file > formats > >>> supported by Arrow. One of our driving use cases favored Parquet. > >>> > >>> Since we perform the scan operation inside the storage layer using Ceph > >>> Object class > >>> < > https://docs.ceph.com/en/latest/rados/api/objclass-sdk/#:~:text=Ceph%20can%20be%20extended%20by,object%20classes%20within%20the%20tree > .> > >>> methods which need to be invoked directly on objects, we utilize the > >>> striping strategy information provided by CephFS to translate filename > in > >>> CephFS to object id in RADOS. To be able to have this one-to-one > mapping, > >>> we split Parquet files in a manner similar to how Spark splits Parquet > >>> files for HDFS and ensure that each fragment is backed by a single > RADOS > >>> object. > >>> > >>> We are planning a new PR, we extend the FileFormat interface to create > a > >>> RadosParquetFileFormat > >>> < > https://github.com/uccross/skyhookdm-arrow/blob/arrow-master/cpp/src/arrow/dataset/file_rados_parquet.h#L129 > > > >>> interface that offloads Parquet file scan operations to the RADOS > layer in > >>> Ceph. Since we now utilize a filesystem interface, we can just use the > >>> FileSystemDataset API and plug in our new format to offload scan > >>> operations. We have also added Python bindings for the new APIs that we > >>> implemented. In all, our patch only consists of around 3,000 LoC and > >>> introduces new dependencies to Ceph’s librados and object class SDK > only > >>> (that can be disabled via cmake flags). > >>> > >>> We have added an architecture > >>> < > https://github.com/uccross/skyhookdm-arrow/blob/rados-parquet-pr/cpp/src/arrow/adapters/arrow-rados-cls/docs/architecture.md > > > >>> document with our PR which describes the overall architecture along > with > >>> the life of a dataset scan on using RadosParquet. Additionally, we > recently > >>> wrote up a paper <https://arxiv.org/abs/2105.09894> describing our > design > >>> and implementation along with some initial benchmarks given there. We > plan > >>> to raise a PR <https://github.com/apache/arrow/pull/10431> to > upstream our > >>> format to apache/arrow soon and hence look forward to your comments and > >>> thoughts on this new feature. Please let us know if you have any > questions. > >>> Thank you. > >>> > >>> Best regards, > >>> > >>> Jayjeet Chakraborty > >>> > >>> On 2020/09/15 18:06:56, Micah Kornfield <emkornfi...@gmail.com> wrote: > >>>> gmock is already a dependency. We haven't upgraded gmock/gtest in a > >>> while, > >>>> we might want to consider doing that (but this is orthogonal). > >>>> > >>>> On Tue, Sep 15, 2020 at 10:16 AM Antoine Pitrou <anto...@python.org> > >>> wrote: > >>>> > >>>>> > >>>>> Hi Ivo, > >>>>> > >>>>> You can open a JIRA once you've got a PR ready. No need to do it > before > >>>>> you think you're ready for submission. > >>>>> > >>>>> AFAIK, gmock is already a dependency. > >>>>> > >>>>> Regards > >>>>> > >>>>> Antoine. > >>>>> > >>>>> > >>>>> > >>>>> Le 15/09/2020 à 18:49, Ivo Jimenez a écrit : > >>>>>> Hi again, > >>>>>> > >>>>>> We noticed in the contribution guidelines that there needs to be an > >>>>> issue for every PR in JIRA. Should we open one for the eventual PR > for > >>> the > >>>>> work we're doing on implementing the dataset on Ceph's RADOS? > >>>>>> > >>>>>> Also, on a related note, we would like to mock the RADOS client so > >>> that > >>>>> we can integrate it in CI tests. Would it be OK to include gmock as a > >>>>> dependency? > >>>>>> > >>>>>> thanks! > >>>>>> > >>>>>> On 2020/09/02 22:05:51, Ivo Jimenez <ivo.jime...@gmail.com> wrote: > >>>>>>> Hi Ben, > >>>>>>> > >>>>>>> > >>>>>>>>> Our main concern is that this new arrow::dataset::RadosFormat > class > >>>>> will > >>>>>>>> be > >>>>>>>>> deriving from the arrow::dataset::FileFormat class, which seems > to > >>>>> raise > >>>>>>>> a > >>>>>>>>> conceptual mismatch as there isn’t really a RADOS format > >>>>>>>> > >>>>>>>> IIUC RADOS doesn't interact with a filesystem directly, so > >>>>> RadosFileFormat > >>>>>>>> would > >>>>>>>> indeed be a conceptually problematic point of extension. If a > RADOS > >>>>> file > >>>>>>>> system > >>>>>>>> is not viable then I think the ideal approach would be to directly > >>>>>>>> implement the > >>>>>>>> Fragment [1] and Dataset [2] interfaces, forgoing a FileFormat > >>>>>>>> implementation altogether. > >>>>>>>> Unfortunately the only example we have of this approach is > >>>>>>>> InMemoryFragment, > >>>>>>>> which simply wraps a vector of record batches. > >>>>>>>> > >>>>>>> > >>>>>>> This is what we will go with, as this seems to be the quickest way > >>> for > >>>>> us > >>>>>>> to have a PoC and start experimenting with this. > >>>>>>> > >>>>>>> Thanks a lot for the invaluable feedback! 🙏 > >>>>>>> > >>>>> > >>>> > >>> > >> > -- *Jayjeet Chakraborty* 4th Year, B.Tech, Undergraduate Department Of Computer Sc. And Engineering National Institute Of Technology, Durgapur PIN: 713205, West Bengal, India M: (+91) 8436500886