Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Wes McKinney Mon, 22 Jul 2019 12:52:26 -0700

Sort of tangentially related, but while we are on the topic:

Please, if you would, avoid checking binary test data files into the
main repository. Use https://github.com/apache/arrow-testing if you
truly need to check in binary data -- something to look out for in
code reviews


On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Jacques,
> Thanks for the clarifications. I think the distinction is useful.
>
> If people want to write adapters for Arrow, I see that as useful but very
> > different than writing native implementations and we should try to create a
> > clear delineation between the two.
>
>
> What do you think about creating a "contrib" directory and moving the JDBC
> and AVRO adapters into it? We should also probably provide more description
> in pom.xml to make it clear for downstream consumers.
>
> We should probably come up with a name other than adapters for
> readers/writer ("converters"?) and use it in the directory structure for
> the existing Orc implementation?
>
> Thanks,
> Micah
>
>
> On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> > As I read through your responses, I think it might be useful to talk about
> > adapters versus native Arrow readers/writers. Adapters are something that
> > adapt an existing API to produce and/or consume Arrow data. A native
> > reader/writer is something that understand the format directly and does not
> > have intermediate representations or APIs the data moves through beyond
> > those that needs to be used to complete work.
> >
> > If people want to write adapters for Arrow, I see that as useful but very
> > different than writing native implementations and we should try to create a
> > clear delineation between the two.
> >
> > Further comments inline.
> >
> >
> >> Could you expand on what level of detail you would like to see a design
> >> document?
> >>
> >
> > A couple paragraphs seems sufficient. This is the goals of the
> > implementation. We target existing functionality X. It is an adapter. Or it
> > is a native impl. This is the expected memory and processing
> > characteristics, etc.  I've never been one for huge amount of design but
> > I've seen a number of recent patches appear where this is no upfront
> > discussion. Making sure that multiple buy into a design is the best way to
> > ensure long-term maintenance and use.
> >
> >
> >> I think this should be optional (the same argument below about predicates
> >> apply so I won't repeat them).
> >>
> >
> > Per my comments above, maybe adapter versus native reader clarifies
> > things. For example, I've been working on a native avro read
> > implementation. It is little more than chicken scratch at this point but
> > its goals, vision and design are very different than the adapter that is
> > being produced atm.
> >
> >
> >> Can you clarify the intent of this objective.  Is it mainly to tie in with
> >> the existing Java arrow memory book keeping?  Performance?  Something
> >> else?
> >>
> >
> > Arrow is designed to be off-heap. If you have large variable amounts of
> > on-heap memory in an application, it starts to make it very hard to make
> > decisions about off-heap versus on-heap memory since those divisions are by
> > and large static in nature. It's fine for short lived applications but for
> > long lived applications, if you're working with a large amount of data, you
> > want to keep most of your memory in one pool. In the context of Arrow, this
> > is going to naturally be off-heap memory.
> >
> >
> >> I'm afraid this might lead to a "perfect is the enemy of the good"
> >> situation.  Starting off with a known good implementation of conversion to
> >> Arrow can allow us to both to profile hot-spots and provide a comparison
> >> of
> >> implementations to verify correctness.
> >>
> >
> > I'm not clear what message we're sending as a community if we produce low
> > performance components. The whole of Arrow is to increase performance, not
> > decrease it. I'm targeting good, not perfect. At the same time, from my
> > perspective, Arrow development should not be approached in the same way
> > that general Java app development should be. If we hold a high standard,
> > we'll have less total integrations initially but I think we'll solve more
> > real world problems.
> >
> > There is also the question of how widely adoptable we want Arrow libraries
> >> to be.
> >> It isn't surprising to me that Impala's Avro reader is an order of
> >> magnitude faster then the stock Java one.  As far as I know Impala's is a
> >> C++ implementation that does JIT with LLVM.  We could try to use it as a
> >> basis for converting to Arrow but I think this might limit adoption in
> >> some
> >> circumstances.  Some organizations/people might be hesitant to adopt the
> >> technology due to:
> >> 1.  Use of JNI.
> >> 2.  Use LLVM to do JIT.
> >>
> >> It seems that as long as we have a reasonably general interface to
> >> data-sources we should be able to optimize/refactor aggressively when
> >> needed.
> >>
> >
> > This is somewhat the crux of the problem. It goes a little bit to who our
> > consuming audience is and what we're trying to deliver. I'll also say that
> > trying to build a high-quality implementation on top of low-quality
> > implementation or library-based adapter is worse than starting from
> > scratch. I believe this is especially true in Java where developers are
> > trained to trust hotspot and that things will be good enough. That is great
> > in a web app but not in systems software where we (and I expect others)
> > will deploy Arrow.
> >
> >
> >> >    3. Propose a generalized "reader" interface as opposed to making each
> >> >    reader have a different way to package/integrate.
> >>
> >> This also seems like a good idea.  Is this something you were thinking of
> >> doing or just a proposal that someone in the community should take up
> >> before we get too many more implementations?
> >>
> >
> > I don't have something in mind and didn't have a plan to build something,
> > just want to make sure we start getting consistent early as opposed to once
> > we have a bunch of readers/adapters.
> >

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Reply via email to