Matt - that is fantastic. Having good, liberally licensed format converters probably takes care of the 50% of the problem. The other 50% will be in figuring out the logical mapping.
Let me think a little bit and propose how can we best set up a collaboration platform. Any suggestion for this welcome. I personally like Google stuff, Hangouts, docs, and Github, of course. On Saturday, September 5, 2015, Matthew Burgess <[email protected]> wrote: > Edmon, > > All our Data Integration (file-format parsing, e.g.) code is Apache-2.0 > licensed, we have parsers/processors > < > https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah > o/di/trans/steps> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin > <https://github.com/mattyb149/load-text-from-file-plugin> (also > Apache-2.0) > using Tika to extract metadata, this could be refactored as a Drill plugin. > > The (semi-)structured-to-tabular conversion will be an issue that most > Drill > extenders will have to deal with, although with powerful functions like > KVGEN() and FLATTEN() it should be less daunting. For graphs > (highly-structured but non-tabular data sources), I'm also looking into a > Gremlin <http://tinkerpop.incubator.apache.org/> plugin, which could > connect Graph Databases with Drill. Again, the problem is representing > non-tabular data in a SQL environment as you mentioned. > > Regards, > Matt > > From: Edmon Begoli <[email protected] <javascript:;>> > Reply-To: <[email protected] <javascript:;>> > Date: Saturday, September 5, 2015 at 8:46 PM > To: <[email protected] <javascript:;>> > Subject: Re: Data representation and conversation - translating nested > hierarchies into a tabular/queriable format > > Matt - any contribution of your time is welcome! Thank you. > > These problems that we are wanting to look into are not easy problems; I > would not expect quick solutions, but any good idea, contribution of time, > or code will help us advance the state of the capabilities. > > I might create a branch or separate Github repo, so that we just use its > wiki for documentation and collaboration, and then later for scratch pad > development. > > Regarding existing tools you might have - *do you think you could bring > this code under the Apache 2 license?* > Knowing what you told me before, I think that contributing this code would > help advance the state of the Drill's format support tremendously. > > I see two major challenges related to what I am proposing: > > 1. (greater challenge) How to bring heterogeneously structured data > logically and semantically into the tabular orientation of a typical SQL > query processing engine. > I think that some problems will not be completely implementable, so we'll > need to either approximate or make some limiting/bounding design choices. > > 2. How to support these new formats through the Drill API. This is more of > just a API study, design and programming effort. Nothing contradictory. > > Edmon > > > > > On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <[email protected] > <javascript:;>> wrote: > > > Challenge accepted! :) are we talking about things like XML, Jsonnet, > > Yaml, etc.? And/or binary file formats that are (semi-)structured in > nature > > like XLSX? > > > > If we want to go more unstructured we could look at Apache Tika to at > > least pull out metadata on things like image and video files, and I'm > > tinkering with the idea of a UDF called topics() for human-generated > text > > using Apache OpenNLP, the problem being a well-trained model for the > target > > data. > > > > Edmon, I admire your ambition and would like to help out where/when I > can. > > Having said that, so far my amount of available time for Drill has been > > embarrassingly lower than my amount of interest. > > > > For well-known file formats, I may be able to help with some of our > > open-source tools for parsing such files. > > > > Regards, > > Matt > > > > Sent from my iPhone > > > >> > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <[email protected] > <javascript:;>> wrote: > >> > > >> > Anyone else from the Drill team wholeheartedly invited. > >> > > >> > Edmon > >> > > >>> >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <[email protected] > <javascript:;>> wrote: > >>> >> > >>> >> Let's do it, Ted. I think it would add tremendous value to Drill > as a > >>> >> solution. > >>> >> > >>> >> I will start a Google doc and share with you so we can share ideas, > >>> >> have Hangouts, design, etc. until we have something solid to put > into > > Drill > >>> >> proper. > >>> >> > >>> >> If you have any other suggestion for the mode of collaboration > please > > let > >>> >> me know. > >>> >> > >>>> >>> On Saturday, September 5, 2015, Ted Dunning < > [email protected] <javascript:;>> > > wrote: > >>>> >>> > >>>>> >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli < > [email protected] <javascript:;>> > > wrote: > >>>>> >>>> > >>>>> >>>> *My question - has this been handled already in Drill and > storage > >>>> >>> formats?* > >>>>> >>>> > >>>>> >>>> If so, where? > >>>>> >>>> > >>>>> >>>> If not,what is your recommendation for handling this? > >>>>> >>>> > >>>>> >>>> Should it be in an independent library outside of Drill that > >>>>> presents > > a > >>>>> >>>> flattened version (not sure if this is possible), or maybe > break the > >>>>> >>>> message into tables corresponding to header data, items, > footer. > >>>> >>> > >>>> >>> Drill does handle these kinds of data well, but currently the > only > file > >>>> >>> formats that it can consume for this kind of data are JSON and > >>>> Parquet. > >>>> >>> > >>>> >>> IT would be great to have more. I would love to work on this > with > you. > >>> >> > > > > > >
