Re: Data representation and conversation - translating nested hierarchies into a tabular/queriable format

Edmon Begoli Sun, 06 Sep 2015 04:17:33 -0700

Matt - that is fantastic. Having good, liberally licensed format converters
probably takes care of the 50% of the problem. The other 50% will be in
figuring out the logical mapping.


Let me think a little bit and propose how can we best set up a
collaboration platform. Any suggestion for this welcome.

I personally like Google stuff, Hangouts, docs, and Github, of course.

On Saturday, September 5, 2015, Matthew Burgess <[email protected]> wrote:

> Edmon,
>
> All our Data Integration (file-format parsing, e.g.) code is Apache-2.0
> licensed, we have parsers/processors
> <
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
> o/di/trans/steps>  for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
> Apache-2.0)
> using Tika to extract metadata, this could be refactored as a Drill plugin.
>
> The (semi-)structured-to-tabular conversion will be an issue that most
> Drill
> extenders will have to deal with, although with powerful functions like
> KVGEN() and FLATTEN() it should be less daunting. For graphs
> (highly-structured but non-tabular data sources), I'm also looking into a
> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
> connect Graph Databases with Drill. Again, the problem is representing
> non-tabular data in a SQL environment as you mentioned.
>
> Regards,
> Matt
>
> From:  Edmon Begoli <[email protected] <javascript:;>>
> Reply-To:  <[email protected] <javascript:;>>
> Date:  Saturday, September 5, 2015 at 8:46 PM
> To:  <[email protected] <javascript:;>>
> Subject:  Re: Data representation and conversation - translating nested
> hierarchies into a tabular/queriable format
>
> Matt - any contribution of your time is welcome! Thank you.
>
> These problems that we are wanting to look into are not easy problems; I
> would not expect quick solutions, but any good idea, contribution of time,
> or code will help us advance the state of the capabilities.
>
> I might create a branch or separate Github repo, so that we just use its
> wiki for documentation and collaboration, and then later for scratch pad
> development.
>
> Regarding existing tools you might have - *do you think you could bring
> this code under the Apache 2 license?*
> Knowing what you told me before, I think that contributing this code would
> help advance the state of the Drill's format support tremendously.
>
> I see two major challenges related to what I am proposing:
>
> 1. (greater challenge) How to bring heterogeneously structured data
> logically and semantically into the tabular orientation of a typical SQL
> query processing engine.
> I think that some problems will not be completely implementable, so we'll
> need to either approximate or make some limiting/bounding design choices.
>
> 2. How to support these new formats through the Drill API. This is more of
> just a API study, design and programming effort. Nothing contradictory.
>
> Edmon
>
>
>
>
> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <[email protected]
> <javascript:;>> wrote:
>
> >  Challenge accepted! :) are we talking about things like XML, Jsonnet,
> >  Yaml, etc.? And/or binary file formats that are (semi-)structured in
> nature
> >  like XLSX?
> >
> >  If we want to go more unstructured we could look at Apache Tika to at
> >  least pull out metadata on things like image and video files, and I'm
> >  tinkering with the idea of a UDF called topics() for human-generated
> text
> >  using Apache OpenNLP, the problem being a well-trained model for the
> target
> >  data.
> >
> >  Edmon, I admire your ambition and would like to help out where/when I
> can.
> >  Having said that, so far my amount of available time for Drill has been
> >  embarrassingly lower than my amount of interest.
> >
> >  For well-known file formats, I may be able to help with some of our
> >  open-source tools for parsing such files.
> >
> >  Regards,
> >  Matt
> >
> >  Sent from my iPhone
> >
> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <[email protected]
> <javascript:;>> wrote:
> >>  >
> >>  > Anyone else from the Drill team wholeheartedly invited.
> >>  >
> >>  > Edmon
> >>  >
> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <[email protected]
> <javascript:;>> wrote:
> >>>  >>
> >>>  >> Let's do it, Ted. I think it would add tremendous value to Drill
> as a
> >>>  >> solution.
> >>>  >>
> >>>  >> I will start a Google doc and share with you so we can share ideas,
> >>>  >> have Hangouts, design, etc. until we have something solid to put
> into
> >  Drill
> >>>  >> proper.
> >>>  >>
> >>>  >> If you have any other suggestion for the mode of collaboration
> please
> >  let
> >>>  >> me know.
> >>>  >>
> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning <
> [email protected] <javascript:;>>
> >  wrote:
> >>>>  >>>
> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <
> [email protected] <javascript:;>>
> >  wrote:
> >>>>>  >>>>
> >>>>>  >>>> *My question - has this been handled already in Drill and
> storage
> >>>>  >>> formats?*
> >>>>>  >>>>
> >>>>>  >>>> If so, where?
> >>>>>  >>>>
> >>>>>  >>>> If not,what is your recommendation for handling this?
> >>>>>  >>>>
> >>>>>  >>>> Should it be in an independent library outside of Drill that
> >>>>> presents
> >  a
> >>>>>  >>>> flattened version (not sure if this is possible), or maybe
> break the
> >>>>>  >>>> message into tables corresponding to header data, items,
> footer.
> >>>>  >>>
> >>>>  >>> Drill does handle these kinds of data well, but currently the
> only
> file
> >>>>  >>> formats that it can consume for this kind of data are JSON and
> >>>> Parquet.
> >>>>  >>>
> >>>>  >>> IT would be great to have more.  I would love to work on this
> with
> you.
> >>>  >>
> >
>
>
>
>

Re: Data representation and conversation - translating nested hierarchies into a tabular/queriable format

Reply via email to