Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Micah Kornfield Mon, 22 Jul 2019 21:40:53 -0700

Hi Wes,
Are there currently files that need to be moved?

Thanks,
Micah


On Monday, July 22, 2019, Wes McKinney <[email protected]> wrote:

> Sort of tangentially related, but while we are on the topic:
>
> Please, if you would, avoid checking binary test data files into the
> main repository. Use https://github.com/apache/arrow-testing if you
> truly need to check in binary data -- something to look out for in
> code reviews
>
> On Mon, Jul 22, 2019 at 10:38 AM Micah Kornfield <[email protected]>
> wrote:
> >
> > Hi Jacques,
> > Thanks for the clarifications. I think the distinction is useful.
> >
> > If people want to write adapters for Arrow, I see that as useful but very
> > > different than writing native implementations and we should try to
> create a
> > > clear delineation between the two.
> >
> >
> > What do you think about creating a "contrib" directory and moving the
> JDBC
> > and AVRO adapters into it? We should also probably provide more
> description
> > in pom.xml to make it clear for downstream consumers.
> >
> > We should probably come up with a name other than adapters for
> > readers/writer ("converters"?) and use it in the directory structure for
> > the existing Orc implementation?
> >
> > Thanks,
> > Micah
> >
> >
> > On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <[email protected]>
> wrote:
> >
> > > As I read through your responses, I think it might be useful to talk
> about
> > > adapters versus native Arrow readers/writers. Adapters are something
> that
> > > adapt an existing API to produce and/or consume Arrow data. A native
> > > reader/writer is something that understand the format directly and
> does not
> > > have intermediate representations or APIs the data moves through beyond
> > > those that needs to be used to complete work.
> > >
> > > If people want to write adapters for Arrow, I see that as useful but
> very
> > > different than writing native implementations and we should try to
> create a
> > > clear delineation between the two.
> > >
> > > Further comments inline.
> > >
> > >
> > >> Could you expand on what level of detail you would like to see a
> design
> > >> document?
> > >>
> > >
> > > A couple paragraphs seems sufficient. This is the goals of the
> > > implementation. We target existing functionality X. It is an adapter.
> Or it
> > > is a native impl. This is the expected memory and processing
> > > characteristics, etc.  I've never been one for huge amount of design
> but
> > > I've seen a number of recent patches appear where this is no upfront
> > > discussion. Making sure that multiple buy into a design is the best
> way to
> > > ensure long-term maintenance and use.
> > >
> > >
> > >> I think this should be optional (the same argument below about
> predicates
> > >> apply so I won't repeat them).
> > >>
> > >
> > > Per my comments above, maybe adapter versus native reader clarifies
> > > things. For example, I've been working on a native avro read
> > > implementation. It is little more than chicken scratch at this point
> but
> > > its goals, vision and design are very different than the adapter that
> is
> > > being produced atm.
> > >
> > >
> > >> Can you clarify the intent of this objective.  Is it mainly to tie in
> with
> > >> the existing Java arrow memory book keeping?  Performance?  Something
> > >> else?
> > >>
> > >
> > > Arrow is designed to be off-heap. If you have large variable amounts of
> > > on-heap memory in an application, it starts to make it very hard to
> make
> > > decisions about off-heap versus on-heap memory since those divisions
> are by
> > > and large static in nature. It's fine for short lived applications but
> for
> > > long lived applications, if you're working with a large amount of
> data, you
> > > want to keep most of your memory in one pool. In the context of Arrow,
> this
> > > is going to naturally be off-heap memory.
> > >
> > >
> > >> I'm afraid this might lead to a "perfect is the enemy of the good"
> > >> situation.  Starting off with a known good implementation of
> conversion to
> > >> Arrow can allow us to both to profile hot-spots and provide a
> comparison
> > >> of
> > >> implementations to verify correctness.
> > >>
> > >
> > > I'm not clear what message we're sending as a community if we produce
> low
> > > performance components. The whole of Arrow is to increase performance,
> not
> > > decrease it. I'm targeting good, not perfect. At the same time, from my
> > > perspective, Arrow development should not be approached in the same way
> > > that general Java app development should be. If we hold a high
> standard,
> > > we'll have less total integrations initially but I think we'll solve
> more
> > > real world problems.
> > >
> > > There is also the question of how widely adoptable we want Arrow
> libraries
> > >> to be.
> > >> It isn't surprising to me that Impala's Avro reader is an order of
> > >> magnitude faster then the stock Java one.  As far as I know Impala's
> is a
> > >> C++ implementation that does JIT with LLVM.  We could try to use it
> as a
> > >> basis for converting to Arrow but I think this might limit adoption in
> > >> some
> > >> circumstances.  Some organizations/people might be hesitant to adopt
> the
> > >> technology due to:
> > >> 1.  Use of JNI.
> > >> 2.  Use LLVM to do JIT.
> > >>
> > >> It seems that as long as we have a reasonably general interface to
> > >> data-sources we should be able to optimize/refactor aggressively when
> > >> needed.
> > >>
> > >
> > > This is somewhat the crux of the problem. It goes a little bit to who
> our
> > > consuming audience is and what we're trying to deliver. I'll also say
> that
> > > trying to build a high-quality implementation on top of low-quality
> > > implementation or library-based adapter is worse than starting from
> > > scratch. I believe this is especially true in Java where developers are
> > > trained to trust hotspot and that things will be good enough. That is
> great
> > > in a web app but not in systems software where we (and I expect others)
> > > will deploy Arrow.
> > >
> > >
> > >> >    3. Propose a generalized "reader" interface as opposed to making
> each
> > >> >    reader have a different way to package/integrate.
> > >>
> > >> This also seems like a good idea.  Is this something you were
> thinking of
> > >> doing or just a proposal that someone in the community should take up
> > >> before we get too many more implementations?
> > >>
> > >
> > > I don't have something in mind and didn't have a plan to build
> something,
> > > just want to make sure we start getting consistent early as opposed to
> once
> > > we have a bunch of readers/adapters.
> > >
>

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Reply via email to