Hi Wes, Are there currently files that need to be moved? Thanks, Micah
On Monday, July 22, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > Sort of tangentially related, but while we are on the topic: > > Please, if you would, avoid checking binary test data files into the > main repository. Use https://github.com/apache/arrow-testing if you > truly need to check in binary data -- something to look out for in > code reviews > > On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Hi Jacques, > > Thanks for the clarifications. I think the distinction is useful. > > > > If people want to write adapters for Arrow, I see that as useful but very > > > different than writing native implementations and we should try to > create a > > > clear delineation between the two. > > > > > > What do you think about creating a "contrib" directory and moving the > JDBC > > and AVRO adapters into it? We should also probably provide more > description > > in pom.xml to make it clear for downstream consumers. > > > > We should probably come up with a name other than adapters for > > readers/writer ("converters"?) and use it in the directory structure for > > the existing Orc implementation? > > > > Thanks, > > Micah > > > > > > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <jacq...@apache.org> > wrote: > > > > > As I read through your responses, I think it might be useful to talk > about > > > adapters versus native Arrow readers/writers. Adapters are something > that > > > adapt an existing API to produce and/or consume Arrow data. A native > > > reader/writer is something that understand the format directly and > does not > > > have intermediate representations or APIs the data moves through beyond > > > those that needs to be used to complete work. > > > > > > If people want to write adapters for Arrow, I see that as useful but > very > > > different than writing native implementations and we should try to > create a > > > clear delineation between the two. > > > > > > Further comments inline. > > > > > > > > >> Could you expand on what level of detail you would like to see a > design > > >> document? > > >> > > > > > > A couple paragraphs seems sufficient. This is the goals of the > > > implementation. We target existing functionality X. It is an adapter. > Or it > > > is a native impl. This is the expected memory and processing > > > characteristics, etc. I've never been one for huge amount of design > but > > > I've seen a number of recent patches appear where this is no upfront > > > discussion. Making sure that multiple buy into a design is the best > way to > > > ensure long-term maintenance and use. > > > > > > > > >> I think this should be optional (the same argument below about > predicates > > >> apply so I won't repeat them). > > >> > > > > > > Per my comments above, maybe adapter versus native reader clarifies > > > things. For example, I've been working on a native avro read > > > implementation. It is little more than chicken scratch at this point > but > > > its goals, vision and design are very different than the adapter that > is > > > being produced atm. > > > > > > > > >> Can you clarify the intent of this objective. Is it mainly to tie in > with > > >> the existing Java arrow memory book keeping? Performance? Something > > >> else? > > >> > > > > > > Arrow is designed to be off-heap. If you have large variable amounts of > > > on-heap memory in an application, it starts to make it very hard to > make > > > decisions about off-heap versus on-heap memory since those divisions > are by > > > and large static in nature. It's fine for short lived applications but > for > > > long lived applications, if you're working with a large amount of > data, you > > > want to keep most of your memory in one pool. In the context of Arrow, > this > > > is going to naturally be off-heap memory. > > > > > > > > >> I'm afraid this might lead to a "perfect is the enemy of the good" > > >> situation. Starting off with a known good implementation of > conversion to > > >> Arrow can allow us to both to profile hot-spots and provide a > comparison > > >> of > > >> implementations to verify correctness. > > >> > > > > > > I'm not clear what message we're sending as a community if we produce > low > > > performance components. The whole of Arrow is to increase performance, > not > > > decrease it. I'm targeting good, not perfect. At the same time, from my > > > perspective, Arrow development should not be approached in the same way > > > that general Java app development should be. If we hold a high > standard, > > > we'll have less total integrations initially but I think we'll solve > more > > > real world problems. > > > > > > There is also the question of how widely adoptable we want Arrow > libraries > > >> to be. > > >> It isn't surprising to me that Impala's Avro reader is an order of > > >> magnitude faster then the stock Java one. As far as I know Impala's > is a > > >> C++ implementation that does JIT with LLVM. We could try to use it > as a > > >> basis for converting to Arrow but I think this might limit adoption in > > >> some > > >> circumstances. Some organizations/people might be hesitant to adopt > the > > >> technology due to: > > >> 1. Use of JNI. > > >> 2. Use LLVM to do JIT. > > >> > > >> It seems that as long as we have a reasonably general interface to > > >> data-sources we should be able to optimize/refactor aggressively when > > >> needed. > > >> > > > > > > This is somewhat the crux of the problem. It goes a little bit to who > our > > > consuming audience is and what we're trying to deliver. I'll also say > that > > > trying to build a high-quality implementation on top of low-quality > > > implementation or library-based adapter is worse than starting from > > > scratch. I believe this is especially true in Java where developers are > > > trained to trust hotspot and that things will be good enough. That is > great > > > in a web app but not in systems software where we (and I expect others) > > > will deploy Arrow. > > > > > > > > >> > 3. Propose a generalized "reader" interface as opposed to making > each > > >> > reader have a different way to package/integrate. > > >> > > >> This also seems like a good idea. Is this something you were > thinking of > > >> doing or just a proposal that someone in the community should take up > > >> before we get too many more implementations? > > >> > > > > > > I don't have something in mind and didn't have a plan to build > something, > > > just want to make sure we start getting consistent early as opposed to > once > > > we have a bunch of readers/adapters. > > > >