Hey Stefan, It is possible that this is the case. A quick look at the code seems to indicate that the Avro reader is not overriding the default behavior of determining approximate row count of files. I believe there is still a small issue with the code handling tiny files, are the files you are dealing with at least a few megabytes?
Can you see how many minor fragments are listed under the scan operation in the query profile? If there are multiple fragments then the scan is parallelized. - Jason On Mon, Feb 29, 2016 at 1:58 PM, Stefán Baxter <[email protected]> wrote: > Hi Jason, > > Is it possible that the Avro plugin does not use any parallelism and that > all the target files are scanned sequentially by the same process? (1.5) > > - Stefán > > On Fri, Feb 26, 2016 at 8:04 PM, Stefán Baxter <[email protected]> > wrote: > > > Thank you Jason. > > > > I do realize that this is an OS project and that everyone is doing their > > best. > > > > There are just a few things I wish I had realized before switching over > > from JSON to Avro that have caused us a lot of problems and taken a long > > time. > > > > Your work is appreciated and I apologize for letting my frustration get > > the better of me. > > > > - Stefán > > > > On Fri, Feb 26, 2016 at 8:00 PM, Jason Altekruse < > [email protected] > > > wrote: > > > >> Stefan, > >> > >> I'm sorry that we have not been better about getting back to the issues > >> you > >> have filed against the Avro reader. We do appreciate all of the effort > you > >> have put into filing thorough bugs and being active in the discussions > on > >> the list. I have responded on the bug you filed on this issue [1] with a > >> workaround and will be posting a patch shortly with a fix. > >> > >> - Jason <https://issues.apache.org/jira/browse/DRILL-4120> > >> > >> [1] - https://issues.apache.org/jira/browse/DRILL-4441 > >> <https://issues.apache.org/jira/browse/DRILL-4120> > >> > >> On Thu, Feb 25, 2016 at 12:29 PM, Stefán Baxter < > >> [email protected]> > >> wrote: > >> > >> > Hi, > >> > > >> > This query targets Avro files in the latest 1.5 release: > >> > > >> > 0: jdbc:drill:zk=local> select count(*) from > >> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to = > >> > 'Customer/4-2492847'; > >> > +---------+ > >> > | EXPR$0 | > >> > +---------+ > >> > | 5788 | > >> > +---------+ > >> > > >> > 0: jdbc:drill:zk=local> select count(*) from > >> > dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN > >> > ('Customer/4-2492847'); > >> > +---------+ > >> > | EXPR$0 | > >> > +---------+ > >> > | 0 | > >> > +---------+ > >> > > >> > It shows that the IN operator does not work with Avro (works with > >> Parquet). > >> > > >> > This finally tips us over. We have invested hundreds of hours moving > all > >> > streaming/fresh data from JSON to Avro but the Avro part of Drill is > >> broken > >> > in too many ways to recommend its use to anyone. > >> > > >> > Attempts to report Avro errors and shortcomings, like the missing > >> support > >> > for dirX, has had no results. > >> > > >> > I think it would be prudent to warn people on the Drill website that > the > >> > Avro support is experimental, at best > >> > > >> > - Stefán Baxter > >> > > >> > > > > >
