Yes I think text files are OK but I want to make sure that committers are reviewing patches for binary files because there have been a number of incidents in the past where I had to roll back patches to remove such files.
On Tue, Jul 23, 2019, 10:37 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Wes, > I haven't checked locally but that file at least for me renders as text > file in GitHub (with an Apache header). If we want all test data in the > testing package I can make sure to move it but I thought text files might > be ok in the main repo? > > Thanks, > Micah > > On Tuesday, July 23, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > >> I noticed that test data-related files are beginning to be checked in >> >> >> https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/resources/schema/test.avsc >> >> I wanted to make sure this doesn't turn into a slippery slope where we >> end up with several megabytes or more of test data files >> >> On Mon, Jul 22, 2019 at 11:39 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > Hi Wes, >> > Are there currently files that need to be moved? >> > >> > Thanks, >> > Micah >> > >> > On Monday, July 22, 2019, Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> Sort of tangentially related, but while we are on the topic: >> >> >> >> Please, if you would, avoid checking binary test data files into the >> >> main repository. Use https://github.com/apache/arrow-testing if you >> >> truly need to check in binary data -- something to look out for in >> >> code reviews >> >> >> >> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield < >> emkornfi...@gmail.com> wrote: >> >> > >> >> > Hi Jacques, >> >> > Thanks for the clarifications. I think the distinction is useful. >> >> > >> >> > If people want to write adapters for Arrow, I see that as useful but >> very >> >> > > different than writing native implementations and we should try to >> create a >> >> > > clear delineation between the two. >> >> > >> >> > >> >> > What do you think about creating a "contrib" directory and moving >> the JDBC >> >> > and AVRO adapters into it? We should also probably provide more >> description >> >> > in pom.xml to make it clear for downstream consumers. >> >> > >> >> > We should probably come up with a name other than adapters for >> >> > readers/writer ("converters"?) and use it in the directory structure >> for >> >> > the existing Orc implementation? >> >> > >> >> > Thanks, >> >> > Micah >> >> > >> >> > >> >> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <jacq...@apache.org> >> wrote: >> >> > >> >> > > As I read through your responses, I think it might be useful to >> talk about >> >> > > adapters versus native Arrow readers/writers. Adapters are >> something that >> >> > > adapt an existing API to produce and/or consume Arrow data. A >> native >> >> > > reader/writer is something that understand the format directly and >> does not >> >> > > have intermediate representations or APIs the data moves through >> beyond >> >> > > those that needs to be used to complete work. >> >> > > >> >> > > If people want to write adapters for Arrow, I see that as useful >> but very >> >> > > different than writing native implementations and we should try to >> create a >> >> > > clear delineation between the two. >> >> > > >> >> > > Further comments inline. >> >> > > >> >> > > >> >> > >> Could you expand on what level of detail you would like to see a >> design >> >> > >> document? >> >> > >> >> >> > > >> >> > > A couple paragraphs seems sufficient. This is the goals of the >> >> > > implementation. We target existing functionality X. It is an >> adapter. Or it >> >> > > is a native impl. This is the expected memory and processing >> >> > > characteristics, etc. I've never been one for huge amount of >> design but >> >> > > I've seen a number of recent patches appear where this is no >> upfront >> >> > > discussion. Making sure that multiple buy into a design is the >> best way to >> >> > > ensure long-term maintenance and use. >> >> > > >> >> > > >> >> > >> I think this should be optional (the same argument below about >> predicates >> >> > >> apply so I won't repeat them). >> >> > >> >> >> > > >> >> > > Per my comments above, maybe adapter versus native reader clarifies >> >> > > things. For example, I've been working on a native avro read >> >> > > implementation. It is little more than chicken scratch at this >> point but >> >> > > its goals, vision and design are very different than the adapter >> that is >> >> > > being produced atm. >> >> > > >> >> > > >> >> > >> Can you clarify the intent of this objective. Is it mainly to >> tie in with >> >> > >> the existing Java arrow memory book keeping? Performance? >> Something >> >> > >> else? >> >> > >> >> >> > > >> >> > > Arrow is designed to be off-heap. If you have large variable >> amounts of >> >> > > on-heap memory in an application, it starts to make it very hard >> to make >> >> > > decisions about off-heap versus on-heap memory since those >> divisions are by >> >> > > and large static in nature. It's fine for short lived applications >> but for >> >> > > long lived applications, if you're working with a large amount of >> data, you >> >> > > want to keep most of your memory in one pool. In the context of >> Arrow, this >> >> > > is going to naturally be off-heap memory. >> >> > > >> >> > > >> >> > >> I'm afraid this might lead to a "perfect is the enemy of the good" >> >> > >> situation. Starting off with a known good implementation of >> conversion to >> >> > >> Arrow can allow us to both to profile hot-spots and provide a >> comparison >> >> > >> of >> >> > >> implementations to verify correctness. >> >> > >> >> >> > > >> >> > > I'm not clear what message we're sending as a community if we >> produce low >> >> > > performance components. The whole of Arrow is to increase >> performance, not >> >> > > decrease it. I'm targeting good, not perfect. At the same time, >> from my >> >> > > perspective, Arrow development should not be approached in the >> same way >> >> > > that general Java app development should be. If we hold a high >> standard, >> >> > > we'll have less total integrations initially but I think we'll >> solve more >> >> > > real world problems. >> >> > > >> >> > > There is also the question of how widely adoptable we want Arrow >> libraries >> >> > >> to be. >> >> > >> It isn't surprising to me that Impala's Avro reader is an order of >> >> > >> magnitude faster then the stock Java one. As far as I know >> Impala's is a >> >> > >> C++ implementation that does JIT with LLVM. We could try to use >> it as a >> >> > >> basis for converting to Arrow but I think this might limit >> adoption in >> >> > >> some >> >> > >> circumstances. Some organizations/people might be hesitant to >> adopt the >> >> > >> technology due to: >> >> > >> 1. Use of JNI. >> >> > >> 2. Use LLVM to do JIT. >> >> > >> >> >> > >> It seems that as long as we have a reasonably general interface to >> >> > >> data-sources we should be able to optimize/refactor aggressively >> when >> >> > >> needed. >> >> > >> >> >> > > >> >> > > This is somewhat the crux of the problem. It goes a little bit to >> who our >> >> > > consuming audience is and what we're trying to deliver. I'll also >> say that >> >> > > trying to build a high-quality implementation on top of low-quality >> >> > > implementation or library-based adapter is worse than starting from >> >> > > scratch. I believe this is especially true in Java where >> developers are >> >> > > trained to trust hotspot and that things will be good enough. That >> is great >> >> > > in a web app but not in systems software where we (and I expect >> others) >> >> > > will deploy Arrow. >> >> > > >> >> > > >> >> > >> > 3. Propose a generalized "reader" interface as opposed to >> making each >> >> > >> > reader have a different way to package/integrate. >> >> > >> >> >> > >> This also seems like a good idea. Is this something you were >> thinking of >> >> > >> doing or just a proposal that someone in the community should >> take up >> >> > >> before we get too many more implementations? >> >> > >> >> >> > > >> >> > > I don't have something in mind and didn't have a plan to build >> something, >> >> > > just want to make sure we start getting consistent early as >> opposed to once >> >> > > we have a bunch of readers/adapters. >> >> > > >> >