Amazing to see Paul’s Chinese welcome words! Also glad to hear the use case
by Wang Liang using Drill and welcome to contribute that as a Drill’s
storage plugin.

On Tue, Jul 9, 2019 at 1:00 AM Paul Rogers <[email protected]>
wrote:

> 王亮 你好,
>
>
> Very creative use of Drill! We usually think of Drill as a tool for "big
> data" distributed file systems such as HDFS, MFS and S3. IPFS seems to be
> for storing web content. I like how you've shown that IPFS is, in fact, a
> distributed file system, and made Drill work in this context.
>
> Perhaps data scientists might benefit from Minerva: instead of everyone
> downloading large data sets and doing queries locally, a data scientist
> could instead query the data where it lives on the web. Such a feature
> would be especially useful if the data changes over time.
>
> As Charles mentioned, it would be great if you could offer Minerva changes
> to the Drill project. Most extensions live within the Drill project itself,
> typically in the "contrib" module.
>
> The other choice would be for Minerva to be a separate project or repo
> that can be integrated with Drill. We have often talked about creating a
> true plugin architecture to support such a model, but gaps remain. Minerva
> might be a good reason to fix the gaps.
> Thanks,
> - Paul
>
>
>
>     On Saturday, July 6, 2019, 02:31:27 AM PDT, 王亮 <[email protected]>
> wrote:
>
>  Hi all,
>
> After reading that excellent book "Learning Apache Drill: Query and Analyze
> Distributed Data Sources with SQL", my classmate and I also wanted to write
> a Drill storage plugin. We found most DFS and NFS have been supported by
> Drill, so we chose a relatively new and promising distributed file system,
> IPFS.
>
> So we built Minerva, a Drill storage plugin that connects IPFS's
> decentralized storage and Drill's flexible query engine. Any data file
> stored on IPFS can be easily accessed from Drill's query interface, just
> like a file stored on a local disk. The basic idea is very simple: run a
> Drill instance along the IPFS daemon, and you can connect to other users on
> IPFS who are also using Minerva. If one of the users happens to have stored
> the file you are trying to query, then Drill can send execution plan to
> that node, who executes the operations locally and returns the results
> back. Of course, other users can benefit from your node as well, if you are
> sharing the data they want. If there are enough people running Minerva,
> data sharing and querying can be made distributed and more efficient!
>
> The query process is as follows:
> 0 The user inputs an SQL statement, referencing a file on IPFS by its CID;
> 1 The Foreman resolves the CIDs of the "pieces" of the data file, as well
> as the IPFS providers of these pieces, by querying the DHT of IPFS;
> 2 The Foreman distributes jobs to drillbits running on the providers.
> 3 Drillbits on the providers read data from the piece of file on their
> local disk, perform any necessary relational operations, and return results
> to the Foreman.
> 4 The Foreman returns the results to the user.
>
> Thanks to the modular design of Drill, we could rather "easily" write this
> storage plugin. Now this plugin supports basic query operations, both read
> and write, but only works with json and csv files. It is not very stable
> for now, and the performance is still poor, mainly because it takes to too
> long to do DHT queries on IPFS. We are trying to improve these problems in
> the future.
>
> If you are insterested, we have made a few slides that explain the ideas in
> details:
> https://www.slideshare.net/BowenDing4/minerva-ipfs-storage-plugin-for-ipfs
>
> Any suggestion is welcome. ^_^
>
> Find the code on GitHub: https://github.com/bdchain/Minerva
>
> Best,
> Wang Liang
>

Reply via email to