Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Micah Kornfield Mon, 22 Jul 2019 08:38:16 -0700

Hi Jacques,
Thanks for the clarifications. I think the distinction is useful.


If people want to write adapters for Arrow, I see that as useful but very
> different than writing native implementations and we should try to create a
> clear delineation between the two.


What do you think about creating a "contrib" directory and moving the JDBC
and AVRO adapters into it? We should also probably provide more description
in pom.xml to make it clear for downstream consumers.

We should probably come up with a name other than adapters for
readers/writer ("converters"?) and use it in the directory structure for
the existing Orc implementation?

Thanks,
Micah


On Sun, Jul 21, 2019 at 6:09 PM Jacques Nadeau <jacq...@apache.org> wrote:

> As I read through your responses, I think it might be useful to talk about
> adapters versus native Arrow readers/writers. Adapters are something that
> adapt an existing API to produce and/or consume Arrow data. A native
> reader/writer is something that understand the format directly and does not
> have intermediate representations or APIs the data moves through beyond
> those that needs to be used to complete work.
>
> If people want to write adapters for Arrow, I see that as useful but very
> different than writing native implementations and we should try to create a
> clear delineation between the two.
>
> Further comments inline.
>
>
>> Could you expand on what level of detail you would like to see a design
>> document?
>>
>
> A couple paragraphs seems sufficient. This is the goals of the
> implementation. We target existing functionality X. It is an adapter. Or it
> is a native impl. This is the expected memory and processing
> characteristics, etc.  I've never been one for huge amount of design but
> I've seen a number of recent patches appear where this is no upfront
> discussion. Making sure that multiple buy into a design is the best way to
> ensure long-term maintenance and use.
>
>
>> I think this should be optional (the same argument below about predicates
>> apply so I won't repeat them).
>>
>
> Per my comments above, maybe adapter versus native reader clarifies
> things. For example, I've been working on a native avro read
> implementation. It is little more than chicken scratch at this point but
> its goals, vision and design are very different than the adapter that is
> being produced atm.
>
>
>> Can you clarify the intent of this objective.  Is it mainly to tie in with
>> the existing Java arrow memory book keeping?  Performance?  Something
>> else?
>>
>
> Arrow is designed to be off-heap. If you have large variable amounts of
> on-heap memory in an application, it starts to make it very hard to make
> decisions about off-heap versus on-heap memory since those divisions are by
> and large static in nature. It's fine for short lived applications but for
> long lived applications, if you're working with a large amount of data, you
> want to keep most of your memory in one pool. In the context of Arrow, this
> is going to naturally be off-heap memory.
>
>
>> I'm afraid this might lead to a "perfect is the enemy of the good"
>> situation.  Starting off with a known good implementation of conversion to
>> Arrow can allow us to both to profile hot-spots and provide a comparison
>> of
>> implementations to verify correctness.
>>
>
> I'm not clear what message we're sending as a community if we produce low
> performance components. The whole of Arrow is to increase performance, not
> decrease it. I'm targeting good, not perfect. At the same time, from my
> perspective, Arrow development should not be approached in the same way
> that general Java app development should be. If we hold a high standard,
> we'll have less total integrations initially but I think we'll solve more
> real world problems.
>
> There is also the question of how widely adoptable we want Arrow libraries
>> to be.
>> It isn't surprising to me that Impala's Avro reader is an order of
>> magnitude faster then the stock Java one.  As far as I know Impala's is a
>> C++ implementation that does JIT with LLVM.  We could try to use it as a
>> basis for converting to Arrow but I think this might limit adoption in
>> some
>> circumstances.  Some organizations/people might be hesitant to adopt the
>> technology due to:
>> 1.  Use of JNI.
>> 2.  Use LLVM to do JIT.
>>
>> It seems that as long as we have a reasonably general interface to
>> data-sources we should be able to optimize/refactor aggressively when
>> needed.
>>
>
> This is somewhat the crux of the problem. It goes a little bit to who our
> consuming audience is and what we're trying to deliver. I'll also say that
> trying to build a high-quality implementation on top of low-quality
> implementation or library-based adapter is worse than starting from
> scratch. I believe this is especially true in Java where developers are
> trained to trust hotspot and that things will be good enough. That is great
> in a web app but not in systems software where we (and I expect others)
> will deploy Arrow.
>
>
>> >    3. Propose a generalized "reader" interface as opposed to making each
>> >    reader have a different way to package/integrate.
>>
>> This also seems like a good idea.  Is this something you were thinking of
>> doing or just a proposal that someone in the community should take up
>> before we get too many more implementations?
>>
>
> I don't have something in mind and didn't have a plan to build something,
> just want to make sure we start getting consistent early as opposed to once
> we have a bunch of readers/adapters.
>

Re: [DISCUSS][JAVA] Designs & goals for readers/writers

Reply via email to