Hi! This pull request fixes a problem with FLATTEN on nested avro records. Please see posts from the user list and the issue https://issues.apache.org/jira/browse/DRILL-4574 for documentation.
I would love to get some feedback! Johannes https://github.com/apache/drill/pull/459 ---------- Forwarded message ---------- From: Johannes Schulte <[email protected]> Date: Tue, Apr 12, 2016 at 11:33 PM Subject: Re: Reading Avro Arrays To: [email protected] After some evenings of digging into the code i more or less had a lucky moment and was able to fix the problem. I wonder why nobody else ran into this problem until now - for me it was a blocker to drill adoption and i am really surprised nobody else ever encountered this issue. I hope that somebody with more knowledge of the codebase can review this and integrate it soon. On Sun, Apr 3, 2016 at 11:29 AM, Johannes Schulte < [email protected]> wrote: > Alright, thanks! I created a pull request and are very open for any input > > https://github.com/apache/drill/pull/459 > > Cheers, > > Johannes > > On Sun, Apr 3, 2016 at 9:10 AM, Abdel Hakim Deneche <[email protected] > > wrote: > >> pull requests are fine. You still need a JIRA though >> >> On Sun, Apr 3, 2016 at 8:03 AM, Johannes Schulte < >> [email protected] >> > wrote: >> >> > I now extended the AvroFormatTest-Suite by two unit tests that show that >> > >> > * Flattening of primitive array works as expected >> > * Flattening of arrays of records does not work properly >> > >> > I spent some time trying to find the reason but it's my first contact >> with >> > the drill-codebase. >> > >> > Is the recommended way of making this unit test available still to >> attach a >> > patch in an issue or is a pull-request also an option? >> > >> > In the context of the recent avro maturity discussion I would love to >> fix >> > this error myself but I would need some hints what goes wrong there >> > internally. >> > >> > Johannes >> > >> > On Fri, Mar 25, 2016 at 10:50 PM, Johannes Schulte < >> > [email protected]> wrote: >> > >> > > Hi Stefan, hi Jacques, thanks for going after this - I almost >> resignated >> > > but know i think it was because i accessed the data over jdbc with >> > squirrel >> > > and got irritated by the unknown type column there. nonetheless, if >> the >> > > schema looks like this: >> > > >> > > >> > > { >> > > "type" : "record", >> > > "name" : "MainRecord", >> > > "namespace" : "drizz.WriteAvroTestFileForDrill$", >> > > "fields" : [ { >> > > "name" : "elements", >> > > "type" : { >> > > "type" : "array", >> > > "items" : { >> > > "type" : "record", >> > > "name" : "NestedRecord", >> > > "fields" : [ { >> > > "name" : "field1", >> > > "type" : "int" >> > > } ] >> > > }, >> > > "java-class" : "java.util.List" >> > > } >> > > } ] >> > > } >> > > >> > > and the contents looks like this (according to avro tojson command >> line >> > > utility) >> > > >> > > >> > > >> > >> {"elements":[{"field1":0},{"field1":1},{"field1":2},{"field1":3},{"field1":4},{"field1":5},{"field1":6},{"field1":7},{"field1":8},{"field1":9}]} >> > > >> > > >> > >> {"elements":[{"field1":0},{"field1":1},{"field1":2},{"field1":3},{"field1":4},{"field1":5},{"field1":6},{"field1":7},{"field1":8},{"field1":9}]} >> > > >> > > a query like >> > > >> > > select flatten(elements) from >> > > dfs.`/Users/j.schulte/data/avro-drill/no-union/`; >> > > >> > > yields exactly two rows: >> > > +---------------+ >> > > | EXPR$0 | >> > > +---------------+ >> > > | {"field1":9} | >> > > | {"field1":9} | >> > > +---------------+ >> > > >> > > as if only the last element in the array would survive. >> > > >> > > Thanks for your help so far.. >> > > >> > > On Fri, Mar 25, 2016 at 5:45 PM, Stefán Baxter < >> > [email protected]> >> > > wrote: >> > > >> > >> Johannes, Jacques is right. >> > >> >> > >> I only tested the flattening of maps and not the flattening of >> > >> list-of-maps. >> > >> >> > >> -Stefan >> > >> >> > >> On Fri, Mar 25, 2016 at 4:12 PM, Jacques Nadeau <[email protected]> >> > >> wrote: >> > >> >> > >> > I think there is some incorrect information and confusion in this >> > >> thread. >> > >> > Could you please share a piece of sample data and a specific query? >> > The >> > >> > error message shown in your original email is suggesting that you >> were >> > >> > trying to flatten a map rather than an array of maps. Flatten is >> for >> > >> arrays >> > >> > only. The arrays can have scalars or complex objects in them. >> > >> > >> > >> > -- >> > >> > Jacques Nadeau >> > >> > CTO and Co-Founder, Dremio >> > >> > >> > >> > On Fri, Mar 25, 2016 at 2:00 AM, Johannes Schulte < >> > >> > [email protected]> wrote: >> > >> > >> > >> > > Hi Stefan, >> > >> > > >> > >> > > thanks for this information - so it seems that there is >> currently no >> > >> way >> > >> > of >> > >> > > accessing nested rich objects with drill; I somehow got that >> wrong >> > >> from >> > >> > the >> > >> > > documentation... >> > >> > > >> > >> > > Cheers, >> > >> > > Johannes >> > >> > > >> > >> > > On Thu, Mar 24, 2016 at 2:14 PM, Stefán Baxter < >> > >> > [email protected]> >> > >> > > wrote: >> > >> > > >> > >> > > > FYI: flattening of embedded structures is not supported in >> Parquet >> > >> > > either. >> > >> > > > >> > >> > > > Regards, >> > >> > > > -Stefan >> > >> > > > >> > >> > > > On Wed, Mar 23, 2016 at 8:51 PM, Johannes Schulte < >> > >> > > > [email protected]> wrote: >> > >> > > > >> > >> > > > > Hi Stefan, >> > >> > > > > >> > >> > > > > thanks for your response and the link to your udf repository, >> > >> it's a >> > >> > > good >> > >> > > > > reference. I tried drill 1.6, the data is an array of complex >> > >> objects >> > >> > > > > though. I will try to setup a drill dev environment and see >> if i >> > >> can >> > >> > > > modify >> > >> > > > > the tests to fail. >> > >> > > > > >> > >> > > > > Johannes >> > >> > > > > >> > >> > > > > On Wed, Mar 23, 2016 at 8:13 PM, Stefán Baxter < >> > >> > > > [email protected]> >> > >> > > > > wrote: >> > >> > > > > >> > >> > > > > > FYI. this seems to be working in 1.6, at least on the Avro >> > data >> > >> > that >> > >> > > we >> > >> > > > > > have. >> > >> > > > > > >> > >> > > > > > On Wed, Mar 23, 2016 at 6:59 PM, Stefán Baxter < >> > >> > > > > [email protected]> >> > >> > > > > > wrote: >> > >> > > > > > >> > >> > > > > > > Hi again, >> > >> > > > > > > >> > >> > > > > > > What version of Drill are you using? >> > >> > > > > > > >> > >> > > > > > > Regards, >> > >> > > > > > > - Stefán >> > >> > > > > > > >> > >> > > > > > > On Wed, Mar 23, 2016 at 4:49 PM, Stefán Baxter < >> > >> > > > > > [email protected]> >> > >> > > > > > > wrote: >> > >> > > > > > > >> > >> > > > > > >> Hi Johannes, >> > >> > > > > > >> >> > >> > > > > > >> As great as Drill is the Avro plugin has been a source >> of >> > >> > > > frustration >> > >> > > > > > for >> > >> > > > > > >> us @activitystream. >> > >> > > > > > >> >> > >> > > > > > >> We have a small UDF library [1] (apache licensed) which >> > >> > contains a >> > >> > > > > > >> function can return an array (List<String>) from Avro >> as a >> > >> CSV >> > >> > > list. >> > >> > > > > > >> >> > >> > > > > > >> You could use that to roll your own or provide me with a >> > >> small >> > >> > > > sample >> > >> > > > > > and >> > >> > > > > > >> I can create a custom flatten function for you. >> > >> > > > > > >> >> > >> > > > > > >> The best would be to wait for a fix but this can >> > potentially >> > >> get >> > >> > > you >> > >> > > > > out >> > >> > > > > > >> of a rough spot. >> > >> > > > > > >> >> > >> > > > > > >> [1] https://github.com/activitystream/asdrill >> > >> > > > > > >> >> > >> > > > > > >> Regards, >> > >> > > > > > >> -Stefán >> > >> > > > > > >> >> > >> > > > > > >> On Wed, Mar 23, 2016 at 9:05 AM, Johannes Schulte < >> > >> > > > > > >> [email protected]> wrote: >> > >> > > > > > >> >> > >> > > > > > >>> Hi, >> > >> > > > > > >>> >> > >> > > > > > >>> when trying to read simple avro arrays with select >> > >> > flatten(array) >> > >> > > > > from >> > >> > > > > > >>> dfs... i get the exception >> > >> > > > > > >>> >> > >> > > > > > >>> SQL Query Error: SYSTEM ERROR: ClassCastException: >> Cannot >> > >> cast >> > >> > > > > > >>> org.apache.drill.exec.vector.complex.MapVector to >> > >> > > > > > >>> >> org.apache.drill.exec.vector.complex.RepeatedValueVector >> > >> > > > > > >>> ^ >> > >> > > > > > >>> >> > >> > > > > > >>> The type of the array is said to be <UnknownType >> (2,002)> >> > >> > > > > > >>> >> > >> > > > > > >>> Is this the expected behaviour? The documentation >> mostly >> > >> talsk >> > >> > > > about >> > >> > > > > > json >> > >> > > > > > >>> and parquet complex types and i wonder if the avro >> storage >> > >> > plugin >> > >> > > > > > behaves >> > >> > > > > > >>> differently. >> > >> > > > > > >>> >> > >> > > > > > >>> Thanks, >> > >> > > > > > >>> >> > >> > > > > > >>> Johannes >> > >> > > > > > >>> >> > >> > > > > > >> >> > >> > > > > > >> >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> >> >> >> -- >> >> Abdelhakim Deneche >> >> Software Engineer >> >> <http://www.mapr.com/> >> >> >> Now Available - Free Hadoop On-Demand Training >> < >> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available >> > >> > >
